CN107086952A - A kind of Bayesian SPAM Filtering method based on TF IDF Chinese word segmentations - Google Patents

A kind of Bayesian SPAM Filtering method based on TF IDF Chinese word segmentations Download PDF

Info

Publication number
CN107086952A
CN107086952A CN201710257123.7A CN201710257123A CN107086952A CN 107086952 A CN107086952 A CN 107086952A CN 201710257123 A CN201710257123 A CN 201710257123A CN 107086952 A CN107086952 A CN 107086952A
Authority
CN
China
Prior art keywords
chinese
word
feature
idf
feature words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710257123.7A
Other languages
Chinese (zh)
Inventor
崔玉文
石乐义
刘晓彤
陈鸿龙
郭宏斌
孙慧
薛智宇
李剑蓝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Petroleum East China
Original Assignee
China University of Petroleum East China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Petroleum East China filed Critical China University of Petroleum East China
Priority to CN201710257123.7A priority Critical patent/CN107086952A/en
Publication of CN107086952A publication Critical patent/CN107086952A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/42Mailbox-related aspects, e.g. synchronisation of mailboxes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/21Monitoring or handling of messages
    • H04L51/212Monitoring or handling of messages using filtering or selective blocking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a kind of Bayesian SPAM Filtering method based on TF IDF Chinese word segmentations, methods described includes:Set up Chinese email training text collection;TF IDF Chinese word segmentations are carried out to Chinese email training text collection according to stop words dictionary, and update stop words dictionary;Feature Words extraction is carried out to Chinese email training text collection by TF IDF Chinese Word Automatic Segmentations, according to the Feature Words of extraction and Feature Words right value update feature word lexicon;Feature Words and Feature Words weights after TF IDF Chinese word segmentations are input to bayes filter and perform mail classification;Classification results feed back to daily record storehouse.Present invention rate of false alarm in Chinese Spam Filtering is low, and execution efficiency is high.

Description

A kind of Bayesian SPAM Filtering method based on TF-IDF Chinese word segmentations
Technical field
The present invention relates to a kind of Bayesian SPAM Filtering method based on TF-IDF Chinese word segmentations, more particularly in Literary Email is carried out during Spam filtering, and Chinese email content is divided by TF-IDF Chinese Word Automatic Segmentations Word, and extract Feature Words and calculate Feature Words weights, Feature Words and its weights are input in Bayes classifier judged afterwards Whether mail is spam, realizes the filtering to spam.
Background technology
Network has become today's society human lives' inalienable part.The high speed development of network technology, allows people The life of class and working method have huge change, and the quality of life of the mankind and the efficiency of work have obtained huge carry Rise.In recent years, Email changed the wastes of manpower such as conventional letter, material resources as the communication technology emerging in network technology With the communication mode of financial resources.Interpersonal communication, study and work become simple efficient.But Email gives our life While work brings convenient, also allowing some, certain interests sends the individual of a large amount of illegal mails or enterprise annoyings to obtain Email User.Spreading unchecked for spam generates huge negative effect to the live and work of Email User.With If being flooded with substantial amounts of spam in the mailbox of family, this does not only bring higher to the study and work of Email User Efficiency, Email User can be made to waste substantial amounts of time and efforts on the contrary and go to handle spam.In face of increasing The puzzlement of spam, it is necessary that a kind of reliable and effective Spam filtering has become development.
Bayesian algorithm with its efficiently, be easily achieved, favorable expandability the characteristics of, be widely applied to Spam filtering In technology.In addition, bayesian algorithm can be by the training to mail sample, automatic learning sample content is entered to spam Row filtering.In existing Spam filtering, bayesian algorithm has shown fabulous answer in Spam filtering Use effect.Especially when to English E-mail classification, the accuracy rate of better simply Bayesian SPAM Filtering device has reached More than 99%.And in the judging rubbish mail of Chinese email and filtering, due to the particularity of Chinese, rate of false alarm is always very It is high.If before Chinese email is classified, the accurate participle of Mail Contents can be accomplished, it will substantially reduce Chinese email The rate of false alarm of classification.
TF-IDF (Term Frequency-Inverse Document Frequency) segmentation methods are by two parts structure Into:TF (Term Frequency, characteristic frequency is word frequency) and IDF (Inverse Document Frequency, reverse text Shelves frequency).Wherein, word frequency (TF) refers to the number of times that Feature Words occur in selected document, and this just illustrates when calculating word , it is necessary to be divided to the word combination in text during frequency, the number of word is counted after division again.Reverse document frequency (IDF) Refer to the measurement of Feature Words general importance.The reverse document frequency for estimating Feature Words is counted by the corpus to foundation The degree that Feature Words occur.Reverse document frequency (IDF) effectively reduction can act on the weights of less high-frequency characteristic word, so that Weaken the influence to text classification, imparting is estimated compared with authority than larger Feature Words while also being acted on than relatively low word frequency Value, improves the accuracy of text classification.
The content of the invention
The present invention is reduces the rate of false alarm of Spam filtering in Chinese email, to improve accuracy, in naive Bayesian On the basis of rubbish mail filtering method, introduce TF-IDF Chinese Word Automatic Segmentations to Mail Contents carry out Feature Words accurately extract with And the appraisal of Feature Words weights, realize a kind of spam high efficiency filter method for Chinese content.
To reach above-mentioned purpose, a kind of Bayesian SPAM Filtering method based on TF-IDF Chinese word segmentations of proposition, Mainly include the following steps that:
(1) Chinese email training sample set, including spam and legitimate mail are collected, Chinese email training text is set up Collection;
(2) TF-IDF Chinese word segmentations are carried out to Chinese email training text collection according to stop words dictionary, and updates stop words Dictionary;
(3) spam and legitimate mail concentrated by TF-IDF Chinese Word Automatic Segmentations to Chinese email training text enter Row Feature Words are extracted, according to the Feature Words of extraction and Feature Words right value update feature word lexicon;
(4) Feature Words after TF-IDF Chinese word segmentations and Feature Words weights are input to bayes filter;
(5) Bayes classifier judges that mail is according to the Feature Words and Feature Words weights in the Chinese email content of input No is spam, and result is fed back to daily record storehouse.
In the step (2), during Chinese email content carries out participle, Chinese Academy of Sciences's ictclas Chinese word segmentations are called Plug-in unit and stop words dictionary, filter out the stop words in Chinese email content, and then realize Chinese email content characteristic word Precisely extract, and stop words Word library updating is carried out to the new stop words occurred in Chinese email content.
In the step (3), Feature Words extraction process is carried out for Chinese email, by TF-IDF Chinese word segmentations to mail Feature Words weights after the Feature Words and statistics of contents extraction are compared with the Feature Words included in feature word lexicon, if deposited In identical Feature Words, corresponding Feature Words weights in dictionary are updated, if it does not, the new Feature Words of addition and its weights are arrived Feature word lexicon.
In the step (4), the Chinese email training set after TF-IDF Chinese word segmentations or new mail are produced Feature Words and Feature Words weights are input in Bayes classifier, pass through the Feature Words and the feature word lexicon meter of foundation of input The probability that Email belongs to spam is calculated, when the probability of spam is more than the threshold value of setting, Email is can determine whether It is otherwise legitimate mail for spam.
In the step (5), the influence for reduction noise characteristic word to mail classification accuracy, in Bayes classifier pair Set up the condition feeds back after E-mail classification, the content and classification results of Email is fed back to daily record storehouse, afterwards day Will storehouse carries out sample training as sample training collection.
Above technical scheme can be seen that in the present invention, than existing Bayes's rubbish postal for Chinese email For part filter method, TF-IDF Chinese Word Automatic Segmentations are combined with Bayesian Classification Arithmetic, pass through TF-IDF Chinese word segmentations The literary Mail Contents of the direct automatic centering of algorithm carry out Feature Words and accurately extracted, and are set up without artificially collecting spam Feature Words Feature dictionary, so as to avoid the accuracy rate in artificial treatment caused by subjectivity in inaccuracy, raising Spam filtering.
In addition, the Email after Bayes's classification can feed back to daily record storehouse, by the way that periodically daily record storehouse is recorded Email type and Mail Contents set up new regular training set automatically, for reconstructing feature word lexicon in Spam filtering Key feature word and its weights, and then automatically update the classifying rules of spam, improve the reliability of Spam filtering And accuracy.
Brief description of the drawings
For the technical scheme in the clearer explanation embodiment of the present invention, below in conjunction with the accompanying drawings with specific embodiment pair The present invention is described further:
Fig. 1 is the Bayesian SPAM Filtering method flow diagram based on TF-IDF Chinese word segmentations of disclosure of the invention;
Fig. 2 is the TF-IDF Chinese email participle flow charts of disclosure of the invention;
Fig. 3 is the Bayesian SPAM Filtering method process of feedback figure based on TF-IDF Chinese word segmentations of disclosure of the invention.
Embodiment
Referring to Fig. 1, it is the Bayesian SPAM Filtering method flow diagram of the invention based on TF-IDF Chinese word segmentations.
Step (1):Chinese email training sample set, including spam and legitimate mail are collected, Chinese email instruction is set up Practice text set.
The Chinese email training sample set of the step (1) is the set of a number of spam and legitimate mail. Spam filtering is the expression according to particular text in Mail Contents, and the judgement for spam is made whether to mail, is entered And carry out Spam filtering.During the Spam Classification based on Bayes classifier, first have to collect certain amount Mail set up training sample set.Feature database is set up according to training sample set, and then according to some of mail feature in feature Performance statistics in storehouse belong to the probability of some classification, so as to realize the classification of mail.For example in the presence of a mail training sample Collect M={ m1, m2..., mn}.Wherein, the mail training sample concentrates the text set that can show itself classification to be assumed to be W= {w1, w2..., wn}.Moreover, it is assumed that the content type of mail text set is expressed as C={ c1, c2..., cn}.So M={ m1, m2..., mnIt is text M to be sortedqCharacteristic vector.The process classified according to Bayes classifier to content of text, can Make P={ p1, p2..., pnRepresent W={ w1, w2..., wnBelong to particular category C={ c1, c2..., cnProbable value.
Step (2):TF-IDF Chinese word segmentations are carried out to Chinese email training text collection according to stop words dictionary, and renewal stops Word dictionary.
In the step (2), during Chinese email content carries out participle, Chinese Academy of Sciences's ictclas Chinese word segmentations are called Plug-in unit and stop words dictionary, filter out the stop words in Chinese email content, and then realize Chinese email content characteristic word Precisely extract, and stop words Word library updating is carried out to the new stop words occurred in Chinese email content.
During Chinese email text set carries out TF-IDF Chinese word segmentations, it is necessary first to the stop words dictionary number of structure According to training text collection data inputting into TF-IDF Chinese word segmentation modules.Referring to TF-IDF Chinese emails shown in Fig. 2 Participle flow chart.In TF-IDF Chinese word segmentation modules, by calling Chinese Academy of Sciences ictclas Chinese word segmentations plug-in unit and stopping for setting up Word dictionary, the useless word filtering such as the function word that Mail Contents participle is obtained, preposition.For being produced after useless word filtering Keyword be the Feature Words of judging rubbish mail, then count Feature Words in judging rubbish mail by TF-IDF algorithms The weights possessed.It is as follows to Feature Words weight computing main process through TF-IDF Chinese Word Automatic Segmentations:
Here it is assumed that it is t that the text set that the mail that receives is set up, which is Feature Words in D, Mail Contents,.By Chinese point After word calculated first is frequency of the Feature Words in some mail text:
Wherein, TF (t, D) is frequency of the Feature Words t in text set D.AvgTF (D) is all Feature Words in text set Average frequency value.
In formula (2), the quantity that some specific Feature Words occurs in the document that text set is included is more, and denominator value will It is bigger, and then the IDF (inverse document frequency) tried to achieve is smaller.
WeightTF-IDF=TF (t, D) × INFT (t, D) (formula 4)
It is assumed that N represents the total quantity of document in document sets, t represents occur Feature Words in document, and n represents the spy that classification is i Levy the total quantity of word, then TF-IDF normalization formula is:
Step (3):The spam concentrated by TF-IDF Chinese Word Automatic Segmentations to Chinese email training text and legal Mail carries out Feature Words extraction, according to the Feature Words of extraction and Feature Words right value update feature word lexicon and dictionary.
In the step (3), Feature Words extraction process is carried out for Chinese email, by TF-IDF Chinese word segmentations to mail Feature Words weights after the Feature Words and statistics of contents extraction are compared with the Feature Words included in feature word lexicon, if deposited In identical Feature Words, corresponding Feature Words weights in dictionary are updated, if it does not, the new Feature Words of addition and its weights are arrived Feature word lexicon.
Step (4):Feature Words after TF-IDF Chinese word segmentations and Feature Words weights are input to bayes filter.
In the step (4), the Chinese email training set after TF-IDF Chinese word segmentations or new mail are produced Feature Words and Feature Words weights are input in Bayes classifier, pass through the Feature Words and the feature word lexicon meter of foundation of input The probability that Email belongs to spam is calculated, when the probability of spam is more than the threshold value of setting, Email is can determine whether It is otherwise legitimate mail for spam.
It is assumed that the mail set M={ f of input1, f2..., fn, its mail classes is C={ good, spam }.Wherein, it is special It is each other completely self-contained to levy word, and the probability calculation for Feature Words can be described as:
P (F) is the probable value for any Feature Words in message,It is for n feature Joint probability as total individual probability product.Because its feature is equal to the probability of spam, equal to any spam Probability is multiplied by the probability of the probability of the feature occurred jointly in spam divided by the feature in any message of observation.For mailbox In the Email that receives whether be spam only judges it is insufficient by the value drawn, we are needed to drawing herein This value do one judgement standard, that is, so-called threshold value.The value and threshold value calculated according to formula (6)Phase Whether more just can determine that the mail received is spam.If P (C=spam | F) > t, that is, value after calculating surpass The limit of threshold value has been crossed, now the Email received can be determined as spam.Conversely, this envelope mail without departing from The limit of threshold value, this envelope mail reception to Email can temporarily be judged as legitimate mail, do not filtered.
Step (5):Bayes classifier judges according to the Feature Words and Feature Words weights in the Chinese email content of input Whether mail is spam, and result is fed back to daily record storehouse.
In the step (5), the influence for reduction noise characteristic word to mail classification accuracy, in Bayes classifier pair Set up the condition feeds back after E-mail classification, the content and classification results of Email is fed back to daily record storehouse, afterwards day Will storehouse carries out sample training as sample training collection.
Bayes classifier in mail sample classification, it is necessary to calculate posterior probability according to the prior probability of statistics, And then decision-making is carried out to mail classification.Referring to Bayesian SPAM Filtering sides of the Fig. 3 based on TF-IDF Chinese word segmentations Method process of feedback figure.After Bayes classifier is classified to mail, mail point can be effectively handled using feedback mechanism Because noise characteristic word produces the influence of mistake classification in class., can according to the Mail Contents and classification results for feeding back to daily record storehouse Periodically to set up new mail training text collection.According to the new text set of establishment, traditional naive Bayesian rubbish can be effectively solved Because combinations of features is continually changing the defect for causing classification error rate high in rubbish mail filtering method.
Above the Bayesian SPAM Filtering method basic step based on TF-IDF Chinese word segmentations retouch in detail State.Rubbish mail filtering method under this scheme, by the way that TF-IDF Chinese Word Automatic Segmentations are applied into Bayes's spam In filter method, it is intended to which Chinese email content is carried out into accurate participle, thus solve Bayesian SPAM Filtering method due to The influence of Chinese word segmentation causes the problem of error rate is high.In addition add anti-during Bayes classifier is to Spam Classification Infeed mechanism, the feature that can effectively solve to be continually changing causes Spam filtering to fail or the problem of poor accuracy.

Claims (5)

1. a kind of Bayesian SPAM Filtering method based on TF-IDF Chinese word segmentations is characterized in that, mainly include following step Suddenly:
(1) Chinese email training sample set, including spam and legitimate mail are collected, Chinese email training text collection is set up;
(2) TF-IDF Chinese word segmentations are carried out to Chinese email training text collection according to stop words dictionary, and updates stop words dictionary;
(3) spam and legitimate mail concentrated by TF-IDF Chinese Word Automatic Segmentations to Chinese email training text carry out special Word extraction is levied, according to the Feature Words of extraction and Feature Words right value update feature word lexicon;
(4) Feature Words after TF-IDF Chinese word segmentations and Feature Words weights are input to bayes filter;
(5) Bayes classifier according to the Feature Words and Feature Words weights in the Chinese email content of input judge mail whether be Spam, and result is fed back to daily record storehouse.
2. a kind of Bayesian SPAM Filtering method its feature based on TF-IDF Chinese word segmentations according to claim 1 It is:In the step (2), during Chinese email content carries out participle, Chinese Academy of Sciences ictclas Chinese word segmentation plug-in units are called And stop words dictionary, the stop words in Chinese email content is filtered out, and then realize the accurate of Chinese email content characteristic word Extract, and stop words Word library updating is carried out to the new stop words occurred in Chinese email content.
3. a kind of Bayes's spam mistake based on TF-IDF Chinese word segmentations according to claim 1 and claim 2 Filtering method is characterized in that:In the step (3), Feature Words extraction process is carried out for Chinese email, passes through TF-IDF Chinese The Feature Words included in Feature Words weights and feature word lexicon after Feature Words and statistics that participle is extracted to Mail Contents are carried out Compare, if there is identical Feature Words, corresponding Feature Words weights in dictionary are updated, if it does not, the new feature of addition Word and its weights are to feature word lexicon.
4. a kind of Bayesian SPAM Filtering method its feature based on TF-IDF Chinese word segmentations according to claim 1 It is:In the step (4), the feature that the Chinese email training set after TF-IDF Chinese word segmentations or new mail are produced Word and Feature Words weights are input in Bayes classifier, and electricity is calculated by the Feature Words of input and the feature word lexicon of foundation Sub- mail belongs to the probability of spam, and when the probability of spam is more than the threshold value of setting, can determine whether Email is rubbish Rubbish mail, is otherwise legitimate mail.
5. a kind of Bayesian SPAM Filtering method its feature based on TF-IDF Chinese word segmentations according to claim 1 It is:In the step (5), the influence for reduction noise characteristic word to mail classification accuracy, in Bayes classifier to electricity Set up the condition feeds back after sub- mail classification, the content and classification results of Email is fed back to daily record storehouse, afterwards daily record Storehouse carries out sample training as sample training collection.
CN201710257123.7A 2017-04-19 2017-04-19 A kind of Bayesian SPAM Filtering method based on TF IDF Chinese word segmentations Pending CN107086952A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710257123.7A CN107086952A (en) 2017-04-19 2017-04-19 A kind of Bayesian SPAM Filtering method based on TF IDF Chinese word segmentations

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710257123.7A CN107086952A (en) 2017-04-19 2017-04-19 A kind of Bayesian SPAM Filtering method based on TF IDF Chinese word segmentations

Publications (1)

Publication Number Publication Date
CN107086952A true CN107086952A (en) 2017-08-22

Family

ID=59612833

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710257123.7A Pending CN107086952A (en) 2017-04-19 2017-04-19 A kind of Bayesian SPAM Filtering method based on TF IDF Chinese word segmentations

Country Status (1)

Country Link
CN (1) CN107086952A (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108199953A (en) * 2018-01-31 2018-06-22 湖北工业大学 A kind of spam filtering method and system
CN108287860A (en) * 2017-09-05 2018-07-17 腾讯科技(深圳)有限公司 Model generating method, garbage files recognition methods and device
CN108427775A (en) * 2018-06-04 2018-08-21 成都市大匠通科技有限公司 A kind of project cost inventory sorting technique based on multinomial Bayes
CN108491390A (en) * 2018-03-28 2018-09-04 江苏满运软件科技有限公司 A kind of main line logistics goods title automatic recognition classification method
CN108804651A (en) * 2018-06-07 2018-11-13 南京邮电大学 A kind of Social behaviors detection method based on reinforcing Bayes's classification
CN108830108A (en) * 2018-06-04 2018-11-16 成都知道创宇信息技术有限公司 A kind of web page contents altering detecting method based on NB Algorithm
CN108985721A (en) * 2018-07-12 2018-12-11 燕山大学 A kind of process for sorting mailings and system
CN109191354A (en) * 2018-08-21 2019-01-11 安徽讯飞智能科技有限公司 A kind of whole people society pipe task distribution method based on natural language processing
CN110149268A (en) * 2019-05-15 2019-08-20 深圳市趣创科技有限公司 A kind of method and its system of automatic fitration spam
CN110300054A (en) * 2019-07-03 2019-10-01 论客科技(广州)有限公司 The recognition methods of malice fishing mail and device
CN110505144A (en) * 2019-08-09 2019-11-26 世纪龙信息网络有限责任公司 Process for sorting mailings, device, equipment and storage medium
CN111079427A (en) * 2019-12-20 2020-04-28 北京金睛云华科技有限公司 Junk mail identification method and system
CN111651598A (en) * 2020-05-28 2020-09-11 上海勃池信息技术有限公司 Spam text auditing device and method through center vector similarity matching
CN112215002A (en) * 2020-11-02 2021-01-12 浙江大学 Electric power system text data classification method based on improved naive Bayes
CN112699242A (en) * 2021-01-11 2021-04-23 大连东软信息学院 Method for identifying Chinese text author
CN116016416A (en) * 2023-03-24 2023-04-25 深圳市明源云科技有限公司 Junk mail identification method, device, equipment and computer readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1889108A (en) * 2005-06-29 2007-01-03 腾讯科技(深圳)有限公司 Method of identifying junk mail
US20080301809A1 (en) * 2007-05-31 2008-12-04 Nortel Networks System and method for detectng malicious mail from spam zombies
CN101996241A (en) * 2010-10-22 2011-03-30 东南大学 Bayesian algorithm-based content filtering method
CN103744905A (en) * 2013-12-25 2014-04-23 新浪网技术(中国)有限公司 Junk mail judgment method and device
CN104731772A (en) * 2015-04-14 2015-06-24 辽宁大学 Improved feature evaluation function based Bayesian spam filtering method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1889108A (en) * 2005-06-29 2007-01-03 腾讯科技(深圳)有限公司 Method of identifying junk mail
US20080301809A1 (en) * 2007-05-31 2008-12-04 Nortel Networks System and method for detectng malicious mail from spam zombies
CN101996241A (en) * 2010-10-22 2011-03-30 东南大学 Bayesian algorithm-based content filtering method
CN103744905A (en) * 2013-12-25 2014-04-23 新浪网技术(中国)有限公司 Junk mail judgment method and device
CN104731772A (en) * 2015-04-14 2015-06-24 辽宁大学 Improved feature evaluation function based Bayesian spam filtering method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陈琦等: "基于TF*IDF的垃圾邮件过滤特征选择改进算法", 《计算机应用研究》 *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108287860A (en) * 2017-09-05 2018-07-17 腾讯科技(深圳)有限公司 Model generating method, garbage files recognition methods and device
CN108199953B (en) * 2018-01-31 2020-09-29 湖北工业大学 Junk mail identification method and system
CN108199953A (en) * 2018-01-31 2018-06-22 湖北工业大学 A kind of spam filtering method and system
CN108491390A (en) * 2018-03-28 2018-09-04 江苏满运软件科技有限公司 A kind of main line logistics goods title automatic recognition classification method
CN108427775A (en) * 2018-06-04 2018-08-21 成都市大匠通科技有限公司 A kind of project cost inventory sorting technique based on multinomial Bayes
CN108830108A (en) * 2018-06-04 2018-11-16 成都知道创宇信息技术有限公司 A kind of web page contents altering detecting method based on NB Algorithm
CN108804651A (en) * 2018-06-07 2018-11-13 南京邮电大学 A kind of Social behaviors detection method based on reinforcing Bayes's classification
CN108804651B (en) * 2018-06-07 2022-08-19 南京邮电大学 Social behavior detection method based on enhanced Bayesian classification
CN108985721A (en) * 2018-07-12 2018-12-11 燕山大学 A kind of process for sorting mailings and system
CN108985721B (en) * 2018-07-12 2020-10-02 燕山大学 Mail classification method and system
CN109191354A (en) * 2018-08-21 2019-01-11 安徽讯飞智能科技有限公司 A kind of whole people society pipe task distribution method based on natural language processing
CN110149268A (en) * 2019-05-15 2019-08-20 深圳市趣创科技有限公司 A kind of method and its system of automatic fitration spam
CN110300054A (en) * 2019-07-03 2019-10-01 论客科技(广州)有限公司 The recognition methods of malice fishing mail and device
CN110505144A (en) * 2019-08-09 2019-11-26 世纪龙信息网络有限责任公司 Process for sorting mailings, device, equipment and storage medium
CN111079427A (en) * 2019-12-20 2020-04-28 北京金睛云华科技有限公司 Junk mail identification method and system
CN111651598A (en) * 2020-05-28 2020-09-11 上海勃池信息技术有限公司 Spam text auditing device and method through center vector similarity matching
CN112215002A (en) * 2020-11-02 2021-01-12 浙江大学 Electric power system text data classification method based on improved naive Bayes
CN112699242A (en) * 2021-01-11 2021-04-23 大连东软信息学院 Method for identifying Chinese text author
CN116016416A (en) * 2023-03-24 2023-04-25 深圳市明源云科技有限公司 Junk mail identification method, device, equipment and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN107086952A (en) A kind of Bayesian SPAM Filtering method based on TF IDF Chinese word segmentations
CN106453033B (en) Multi-level process for sorting mailings based on Mail Contents
CN103514174B (en) A kind of file classification method and device
CN104463552B (en) Calendar reminding generation method and device
Faguo et al. Research on short text classification algorithm based on statistics and rules
CN103024746A (en) System and method for processing spam short messages for telecommunication operator
CN103984703B (en) Mail classification method and device
CN104050556B (en) The feature selection approach and its detection method of a kind of spam
Christina et al. Email spam filtering using supervised machine learning techniques
CN101908055B (en) Method for setting information classification threshold for optimizing lam percentage and information filtering system using same
CN110781679B (en) News event keyword mining method based on associated semantic chain network
CN105843851A (en) Analyzing and extracting method and device of cheating mails
CN101295381A (en) Junk mail detecting method
CN104731772B (en) Improved feature evaluation function based Bayesian spam filtering method
CN110149268A (en) A kind of method and its system of automatic fitration spam
CN102945246A (en) Method and device for processing network information data
CN107544961A (en) A kind of sentiment analysis method, equipment and its storage device of social media comment
Deng et al. Research on a naive bayesian based short message filtering system
CN105337842B (en) A kind of rubbish mail filtering method unrelated with content
CN105117466A (en) Internet information screening system and method
Anitha et al. Email spam filtering using machine learning based XGBoost classifier method
CN106230690B (en) A kind of process for sorting mailings and system of combination user property
CN101329668A (en) Method and apparatus for generating information regulation and method and system for judging information types
Mirza et al. Evaluating efficiency of classifier for email spam detector using hybrid feature selection approaches
CN106294542B (en) A kind of letters and calls data mining methods of marking and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170822

WD01 Invention patent application deemed withdrawn after publication