CN107086952A - A kind of Bayesian SPAM Filtering method based on TF IDF Chinese word segmentations - Google Patents
A kind of Bayesian SPAM Filtering method based on TF IDF Chinese word segmentations Download PDFInfo
- Publication number
- CN107086952A CN107086952A CN201710257123.7A CN201710257123A CN107086952A CN 107086952 A CN107086952 A CN 107086952A CN 201710257123 A CN201710257123 A CN 201710257123A CN 107086952 A CN107086952 A CN 107086952A
- Authority
- CN
- China
- Prior art keywords
- chinese
- word
- feature
- idf
- feature words
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L51/00—User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
- H04L51/42—Mailbox-related aspects, e.g. synchronisation of mailboxes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L51/00—User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
- H04L51/21—Monitoring or handling of messages
- H04L51/212—Monitoring or handling of messages using filtering or selective blocking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Probability & Statistics with Applications (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a kind of Bayesian SPAM Filtering method based on TF IDF Chinese word segmentations, methods described includes:Set up Chinese email training text collection;TF IDF Chinese word segmentations are carried out to Chinese email training text collection according to stop words dictionary, and update stop words dictionary;Feature Words extraction is carried out to Chinese email training text collection by TF IDF Chinese Word Automatic Segmentations, according to the Feature Words of extraction and Feature Words right value update feature word lexicon;Feature Words and Feature Words weights after TF IDF Chinese word segmentations are input to bayes filter and perform mail classification;Classification results feed back to daily record storehouse.Present invention rate of false alarm in Chinese Spam Filtering is low, and execution efficiency is high.
Description
Technical field
The present invention relates to a kind of Bayesian SPAM Filtering method based on TF-IDF Chinese word segmentations, more particularly in
Literary Email is carried out during Spam filtering, and Chinese email content is divided by TF-IDF Chinese Word Automatic Segmentations
Word, and extract Feature Words and calculate Feature Words weights, Feature Words and its weights are input in Bayes classifier judged afterwards
Whether mail is spam, realizes the filtering to spam.
Background technology
Network has become today's society human lives' inalienable part.The high speed development of network technology, allows people
The life of class and working method have huge change, and the quality of life of the mankind and the efficiency of work have obtained huge carry
Rise.In recent years, Email changed the wastes of manpower such as conventional letter, material resources as the communication technology emerging in network technology
With the communication mode of financial resources.Interpersonal communication, study and work become simple efficient.But Email gives our life
While work brings convenient, also allowing some, certain interests sends the individual of a large amount of illegal mails or enterprise annoyings to obtain
Email User.Spreading unchecked for spam generates huge negative effect to the live and work of Email User.With
If being flooded with substantial amounts of spam in the mailbox of family, this does not only bring higher to the study and work of Email User
Efficiency, Email User can be made to waste substantial amounts of time and efforts on the contrary and go to handle spam.In face of increasing
The puzzlement of spam, it is necessary that a kind of reliable and effective Spam filtering has become development.
Bayesian algorithm with its efficiently, be easily achieved, favorable expandability the characteristics of, be widely applied to Spam filtering
In technology.In addition, bayesian algorithm can be by the training to mail sample, automatic learning sample content is entered to spam
Row filtering.In existing Spam filtering, bayesian algorithm has shown fabulous answer in Spam filtering
Use effect.Especially when to English E-mail classification, the accuracy rate of better simply Bayesian SPAM Filtering device has reached
More than 99%.And in the judging rubbish mail of Chinese email and filtering, due to the particularity of Chinese, rate of false alarm is always very
It is high.If before Chinese email is classified, the accurate participle of Mail Contents can be accomplished, it will substantially reduce Chinese email
The rate of false alarm of classification.
TF-IDF (Term Frequency-Inverse Document Frequency) segmentation methods are by two parts structure
Into:TF (Term Frequency, characteristic frequency is word frequency) and IDF (Inverse Document Frequency, reverse text
Shelves frequency).Wherein, word frequency (TF) refers to the number of times that Feature Words occur in selected document, and this just illustrates when calculating word
, it is necessary to be divided to the word combination in text during frequency, the number of word is counted after division again.Reverse document frequency (IDF)
Refer to the measurement of Feature Words general importance.The reverse document frequency for estimating Feature Words is counted by the corpus to foundation
The degree that Feature Words occur.Reverse document frequency (IDF) effectively reduction can act on the weights of less high-frequency characteristic word, so that
Weaken the influence to text classification, imparting is estimated compared with authority than larger Feature Words while also being acted on than relatively low word frequency
Value, improves the accuracy of text classification.
The content of the invention
The present invention is reduces the rate of false alarm of Spam filtering in Chinese email, to improve accuracy, in naive Bayesian
On the basis of rubbish mail filtering method, introduce TF-IDF Chinese Word Automatic Segmentations to Mail Contents carry out Feature Words accurately extract with
And the appraisal of Feature Words weights, realize a kind of spam high efficiency filter method for Chinese content.
To reach above-mentioned purpose, a kind of Bayesian SPAM Filtering method based on TF-IDF Chinese word segmentations of proposition,
Mainly include the following steps that:
(1) Chinese email training sample set, including spam and legitimate mail are collected, Chinese email training text is set up
Collection;
(2) TF-IDF Chinese word segmentations are carried out to Chinese email training text collection according to stop words dictionary, and updates stop words
Dictionary;
(3) spam and legitimate mail concentrated by TF-IDF Chinese Word Automatic Segmentations to Chinese email training text enter
Row Feature Words are extracted, according to the Feature Words of extraction and Feature Words right value update feature word lexicon;
(4) Feature Words after TF-IDF Chinese word segmentations and Feature Words weights are input to bayes filter;
(5) Bayes classifier judges that mail is according to the Feature Words and Feature Words weights in the Chinese email content of input
No is spam, and result is fed back to daily record storehouse.
In the step (2), during Chinese email content carries out participle, Chinese Academy of Sciences's ictclas Chinese word segmentations are called
Plug-in unit and stop words dictionary, filter out the stop words in Chinese email content, and then realize Chinese email content characteristic word
Precisely extract, and stop words Word library updating is carried out to the new stop words occurred in Chinese email content.
In the step (3), Feature Words extraction process is carried out for Chinese email, by TF-IDF Chinese word segmentations to mail
Feature Words weights after the Feature Words and statistics of contents extraction are compared with the Feature Words included in feature word lexicon, if deposited
In identical Feature Words, corresponding Feature Words weights in dictionary are updated, if it does not, the new Feature Words of addition and its weights are arrived
Feature word lexicon.
In the step (4), the Chinese email training set after TF-IDF Chinese word segmentations or new mail are produced
Feature Words and Feature Words weights are input in Bayes classifier, pass through the Feature Words and the feature word lexicon meter of foundation of input
The probability that Email belongs to spam is calculated, when the probability of spam is more than the threshold value of setting, Email is can determine whether
It is otherwise legitimate mail for spam.
In the step (5), the influence for reduction noise characteristic word to mail classification accuracy, in Bayes classifier pair
Set up the condition feeds back after E-mail classification, the content and classification results of Email is fed back to daily record storehouse, afterwards day
Will storehouse carries out sample training as sample training collection.
Above technical scheme can be seen that in the present invention, than existing Bayes's rubbish postal for Chinese email
For part filter method, TF-IDF Chinese Word Automatic Segmentations are combined with Bayesian Classification Arithmetic, pass through TF-IDF Chinese word segmentations
The literary Mail Contents of the direct automatic centering of algorithm carry out Feature Words and accurately extracted, and are set up without artificially collecting spam Feature Words
Feature dictionary, so as to avoid the accuracy rate in artificial treatment caused by subjectivity in inaccuracy, raising Spam filtering.
In addition, the Email after Bayes's classification can feed back to daily record storehouse, by the way that periodically daily record storehouse is recorded
Email type and Mail Contents set up new regular training set automatically, for reconstructing feature word lexicon in Spam filtering
Key feature word and its weights, and then automatically update the classifying rules of spam, improve the reliability of Spam filtering
And accuracy.
Brief description of the drawings
For the technical scheme in the clearer explanation embodiment of the present invention, below in conjunction with the accompanying drawings with specific embodiment pair
The present invention is described further:
Fig. 1 is the Bayesian SPAM Filtering method flow diagram based on TF-IDF Chinese word segmentations of disclosure of the invention;
Fig. 2 is the TF-IDF Chinese email participle flow charts of disclosure of the invention;
Fig. 3 is the Bayesian SPAM Filtering method process of feedback figure based on TF-IDF Chinese word segmentations of disclosure of the invention.
Embodiment
Referring to Fig. 1, it is the Bayesian SPAM Filtering method flow diagram of the invention based on TF-IDF Chinese word segmentations.
Step (1):Chinese email training sample set, including spam and legitimate mail are collected, Chinese email instruction is set up
Practice text set.
The Chinese email training sample set of the step (1) is the set of a number of spam and legitimate mail.
Spam filtering is the expression according to particular text in Mail Contents, and the judgement for spam is made whether to mail, is entered
And carry out Spam filtering.During the Spam Classification based on Bayes classifier, first have to collect certain amount
Mail set up training sample set.Feature database is set up according to training sample set, and then according to some of mail feature in feature
Performance statistics in storehouse belong to the probability of some classification, so as to realize the classification of mail.For example in the presence of a mail training sample
Collect M={ m1, m2..., mn}.Wherein, the mail training sample concentrates the text set that can show itself classification to be assumed to be W=
{w1, w2..., wn}.Moreover, it is assumed that the content type of mail text set is expressed as C={ c1, c2..., cn}.So M={ m1,
m2..., mnIt is text M to be sortedqCharacteristic vector.The process classified according to Bayes classifier to content of text, can
Make P={ p1, p2..., pnRepresent W={ w1, w2..., wnBelong to particular category C={ c1, c2..., cnProbable value.
Step (2):TF-IDF Chinese word segmentations are carried out to Chinese email training text collection according to stop words dictionary, and renewal stops
Word dictionary.
In the step (2), during Chinese email content carries out participle, Chinese Academy of Sciences's ictclas Chinese word segmentations are called
Plug-in unit and stop words dictionary, filter out the stop words in Chinese email content, and then realize Chinese email content characteristic word
Precisely extract, and stop words Word library updating is carried out to the new stop words occurred in Chinese email content.
During Chinese email text set carries out TF-IDF Chinese word segmentations, it is necessary first to the stop words dictionary number of structure
According to training text collection data inputting into TF-IDF Chinese word segmentation modules.Referring to TF-IDF Chinese emails shown in Fig. 2
Participle flow chart.In TF-IDF Chinese word segmentation modules, by calling Chinese Academy of Sciences ictclas Chinese word segmentations plug-in unit and stopping for setting up
Word dictionary, the useless word filtering such as the function word that Mail Contents participle is obtained, preposition.For being produced after useless word filtering
Keyword be the Feature Words of judging rubbish mail, then count Feature Words in judging rubbish mail by TF-IDF algorithms
The weights possessed.It is as follows to Feature Words weight computing main process through TF-IDF Chinese Word Automatic Segmentations:
Here it is assumed that it is t that the text set that the mail that receives is set up, which is Feature Words in D, Mail Contents,.By Chinese point
After word calculated first is frequency of the Feature Words in some mail text:
Wherein, TF (t, D) is frequency of the Feature Words t in text set D.AvgTF (D) is all Feature Words in text set
Average frequency value.
In formula (2), the quantity that some specific Feature Words occurs in the document that text set is included is more, and denominator value will
It is bigger, and then the IDF (inverse document frequency) tried to achieve is smaller.
WeightTF-IDF=TF (t, D) × INFT (t, D) (formula 4)
It is assumed that N represents the total quantity of document in document sets, t represents occur Feature Words in document, and n represents the spy that classification is i
Levy the total quantity of word, then TF-IDF normalization formula is:
Step (3):The spam concentrated by TF-IDF Chinese Word Automatic Segmentations to Chinese email training text and legal
Mail carries out Feature Words extraction, according to the Feature Words of extraction and Feature Words right value update feature word lexicon and dictionary.
In the step (3), Feature Words extraction process is carried out for Chinese email, by TF-IDF Chinese word segmentations to mail
Feature Words weights after the Feature Words and statistics of contents extraction are compared with the Feature Words included in feature word lexicon, if deposited
In identical Feature Words, corresponding Feature Words weights in dictionary are updated, if it does not, the new Feature Words of addition and its weights are arrived
Feature word lexicon.
Step (4):Feature Words after TF-IDF Chinese word segmentations and Feature Words weights are input to bayes filter.
In the step (4), the Chinese email training set after TF-IDF Chinese word segmentations or new mail are produced
Feature Words and Feature Words weights are input in Bayes classifier, pass through the Feature Words and the feature word lexicon meter of foundation of input
The probability that Email belongs to spam is calculated, when the probability of spam is more than the threshold value of setting, Email is can determine whether
It is otherwise legitimate mail for spam.
It is assumed that the mail set M={ f of input1, f2..., fn, its mail classes is C={ good, spam }.Wherein, it is special
It is each other completely self-contained to levy word, and the probability calculation for Feature Words can be described as:
P (F) is the probable value for any Feature Words in message,It is for n feature
Joint probability as total individual probability product.Because its feature is equal to the probability of spam, equal to any spam
Probability is multiplied by the probability of the probability of the feature occurred jointly in spam divided by the feature in any message of observation.For mailbox
In the Email that receives whether be spam only judges it is insufficient by the value drawn, we are needed to drawing herein
This value do one judgement standard, that is, so-called threshold value.The value and threshold value calculated according to formula (6)Phase
Whether more just can determine that the mail received is spam.If P (C=spam | F) > t, that is, value after calculating surpass
The limit of threshold value has been crossed, now the Email received can be determined as spam.Conversely, this envelope mail without departing from
The limit of threshold value, this envelope mail reception to Email can temporarily be judged as legitimate mail, do not filtered.
Step (5):Bayes classifier judges according to the Feature Words and Feature Words weights in the Chinese email content of input
Whether mail is spam, and result is fed back to daily record storehouse.
In the step (5), the influence for reduction noise characteristic word to mail classification accuracy, in Bayes classifier pair
Set up the condition feeds back after E-mail classification, the content and classification results of Email is fed back to daily record storehouse, afterwards day
Will storehouse carries out sample training as sample training collection.
Bayes classifier in mail sample classification, it is necessary to calculate posterior probability according to the prior probability of statistics,
And then decision-making is carried out to mail classification.Referring to Bayesian SPAM Filtering sides of the Fig. 3 based on TF-IDF Chinese word segmentations
Method process of feedback figure.After Bayes classifier is classified to mail, mail point can be effectively handled using feedback mechanism
Because noise characteristic word produces the influence of mistake classification in class., can according to the Mail Contents and classification results for feeding back to daily record storehouse
Periodically to set up new mail training text collection.According to the new text set of establishment, traditional naive Bayesian rubbish can be effectively solved
Because combinations of features is continually changing the defect for causing classification error rate high in rubbish mail filtering method.
Above the Bayesian SPAM Filtering method basic step based on TF-IDF Chinese word segmentations retouch in detail
State.Rubbish mail filtering method under this scheme, by the way that TF-IDF Chinese Word Automatic Segmentations are applied into Bayes's spam
In filter method, it is intended to which Chinese email content is carried out into accurate participle, thus solve Bayesian SPAM Filtering method due to
The influence of Chinese word segmentation causes the problem of error rate is high.In addition add anti-during Bayes classifier is to Spam Classification
Infeed mechanism, the feature that can effectively solve to be continually changing causes Spam filtering to fail or the problem of poor accuracy.
Claims (5)
1. a kind of Bayesian SPAM Filtering method based on TF-IDF Chinese word segmentations is characterized in that, mainly include following step
Suddenly:
(1) Chinese email training sample set, including spam and legitimate mail are collected, Chinese email training text collection is set up;
(2) TF-IDF Chinese word segmentations are carried out to Chinese email training text collection according to stop words dictionary, and updates stop words dictionary;
(3) spam and legitimate mail concentrated by TF-IDF Chinese Word Automatic Segmentations to Chinese email training text carry out special
Word extraction is levied, according to the Feature Words of extraction and Feature Words right value update feature word lexicon;
(4) Feature Words after TF-IDF Chinese word segmentations and Feature Words weights are input to bayes filter;
(5) Bayes classifier according to the Feature Words and Feature Words weights in the Chinese email content of input judge mail whether be
Spam, and result is fed back to daily record storehouse.
2. a kind of Bayesian SPAM Filtering method its feature based on TF-IDF Chinese word segmentations according to claim 1
It is:In the step (2), during Chinese email content carries out participle, Chinese Academy of Sciences ictclas Chinese word segmentation plug-in units are called
And stop words dictionary, the stop words in Chinese email content is filtered out, and then realize the accurate of Chinese email content characteristic word
Extract, and stop words Word library updating is carried out to the new stop words occurred in Chinese email content.
3. a kind of Bayes's spam mistake based on TF-IDF Chinese word segmentations according to claim 1 and claim 2
Filtering method is characterized in that:In the step (3), Feature Words extraction process is carried out for Chinese email, passes through TF-IDF Chinese
The Feature Words included in Feature Words weights and feature word lexicon after Feature Words and statistics that participle is extracted to Mail Contents are carried out
Compare, if there is identical Feature Words, corresponding Feature Words weights in dictionary are updated, if it does not, the new feature of addition
Word and its weights are to feature word lexicon.
4. a kind of Bayesian SPAM Filtering method its feature based on TF-IDF Chinese word segmentations according to claim 1
It is:In the step (4), the feature that the Chinese email training set after TF-IDF Chinese word segmentations or new mail are produced
Word and Feature Words weights are input in Bayes classifier, and electricity is calculated by the Feature Words of input and the feature word lexicon of foundation
Sub- mail belongs to the probability of spam, and when the probability of spam is more than the threshold value of setting, can determine whether Email is rubbish
Rubbish mail, is otherwise legitimate mail.
5. a kind of Bayesian SPAM Filtering method its feature based on TF-IDF Chinese word segmentations according to claim 1
It is:In the step (5), the influence for reduction noise characteristic word to mail classification accuracy, in Bayes classifier to electricity
Set up the condition feeds back after sub- mail classification, the content and classification results of Email is fed back to daily record storehouse, afterwards daily record
Storehouse carries out sample training as sample training collection.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710257123.7A CN107086952A (en) | 2017-04-19 | 2017-04-19 | A kind of Bayesian SPAM Filtering method based on TF IDF Chinese word segmentations |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710257123.7A CN107086952A (en) | 2017-04-19 | 2017-04-19 | A kind of Bayesian SPAM Filtering method based on TF IDF Chinese word segmentations |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107086952A true CN107086952A (en) | 2017-08-22 |
Family
ID=59612833
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710257123.7A Pending CN107086952A (en) | 2017-04-19 | 2017-04-19 | A kind of Bayesian SPAM Filtering method based on TF IDF Chinese word segmentations |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107086952A (en) |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108199953A (en) * | 2018-01-31 | 2018-06-22 | 湖北工业大学 | A kind of spam filtering method and system |
CN108287860A (en) * | 2017-09-05 | 2018-07-17 | 腾讯科技(深圳)有限公司 | Model generating method, garbage files recognition methods and device |
CN108427775A (en) * | 2018-06-04 | 2018-08-21 | 成都市大匠通科技有限公司 | A kind of project cost inventory sorting technique based on multinomial Bayes |
CN108491390A (en) * | 2018-03-28 | 2018-09-04 | 江苏满运软件科技有限公司 | A kind of main line logistics goods title automatic recognition classification method |
CN108804651A (en) * | 2018-06-07 | 2018-11-13 | 南京邮电大学 | A kind of Social behaviors detection method based on reinforcing Bayes's classification |
CN108830108A (en) * | 2018-06-04 | 2018-11-16 | 成都知道创宇信息技术有限公司 | A kind of web page contents altering detecting method based on NB Algorithm |
CN108985721A (en) * | 2018-07-12 | 2018-12-11 | 燕山大学 | A kind of process for sorting mailings and system |
CN109191354A (en) * | 2018-08-21 | 2019-01-11 | 安徽讯飞智能科技有限公司 | A kind of whole people society pipe task distribution method based on natural language processing |
CN110149268A (en) * | 2019-05-15 | 2019-08-20 | 深圳市趣创科技有限公司 | A kind of method and its system of automatic fitration spam |
CN110300054A (en) * | 2019-07-03 | 2019-10-01 | 论客科技(广州)有限公司 | The recognition methods of malice fishing mail and device |
CN110505144A (en) * | 2019-08-09 | 2019-11-26 | 世纪龙信息网络有限责任公司 | Process for sorting mailings, device, equipment and storage medium |
CN111079427A (en) * | 2019-12-20 | 2020-04-28 | 北京金睛云华科技有限公司 | Junk mail identification method and system |
CN111651598A (en) * | 2020-05-28 | 2020-09-11 | 上海勃池信息技术有限公司 | Spam text auditing device and method through center vector similarity matching |
CN112215002A (en) * | 2020-11-02 | 2021-01-12 | 浙江大学 | Electric power system text data classification method based on improved naive Bayes |
CN112699242A (en) * | 2021-01-11 | 2021-04-23 | 大连东软信息学院 | Method for identifying Chinese text author |
CN116016416A (en) * | 2023-03-24 | 2023-04-25 | 深圳市明源云科技有限公司 | Junk mail identification method, device, equipment and computer readable storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1889108A (en) * | 2005-06-29 | 2007-01-03 | 腾讯科技(深圳)有限公司 | Method of identifying junk mail |
US20080301809A1 (en) * | 2007-05-31 | 2008-12-04 | Nortel Networks | System and method for detectng malicious mail from spam zombies |
CN101996241A (en) * | 2010-10-22 | 2011-03-30 | 东南大学 | Bayesian algorithm-based content filtering method |
CN103744905A (en) * | 2013-12-25 | 2014-04-23 | 新浪网技术(中国)有限公司 | Junk mail judgment method and device |
CN104731772A (en) * | 2015-04-14 | 2015-06-24 | 辽宁大学 | Improved feature evaluation function based Bayesian spam filtering method |
-
2017
- 2017-04-19 CN CN201710257123.7A patent/CN107086952A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1889108A (en) * | 2005-06-29 | 2007-01-03 | 腾讯科技(深圳)有限公司 | Method of identifying junk mail |
US20080301809A1 (en) * | 2007-05-31 | 2008-12-04 | Nortel Networks | System and method for detectng malicious mail from spam zombies |
CN101996241A (en) * | 2010-10-22 | 2011-03-30 | 东南大学 | Bayesian algorithm-based content filtering method |
CN103744905A (en) * | 2013-12-25 | 2014-04-23 | 新浪网技术(中国)有限公司 | Junk mail judgment method and device |
CN104731772A (en) * | 2015-04-14 | 2015-06-24 | 辽宁大学 | Improved feature evaluation function based Bayesian spam filtering method |
Non-Patent Citations (1)
Title |
---|
陈琦等: "基于TF*IDF的垃圾邮件过滤特征选择改进算法", 《计算机应用研究》 * |
Cited By (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108287860A (en) * | 2017-09-05 | 2018-07-17 | 腾讯科技(深圳)有限公司 | Model generating method, garbage files recognition methods and device |
CN108199953B (en) * | 2018-01-31 | 2020-09-29 | 湖北工业大学 | Junk mail identification method and system |
CN108199953A (en) * | 2018-01-31 | 2018-06-22 | 湖北工业大学 | A kind of spam filtering method and system |
CN108491390A (en) * | 2018-03-28 | 2018-09-04 | 江苏满运软件科技有限公司 | A kind of main line logistics goods title automatic recognition classification method |
CN108427775A (en) * | 2018-06-04 | 2018-08-21 | 成都市大匠通科技有限公司 | A kind of project cost inventory sorting technique based on multinomial Bayes |
CN108830108A (en) * | 2018-06-04 | 2018-11-16 | 成都知道创宇信息技术有限公司 | A kind of web page contents altering detecting method based on NB Algorithm |
CN108804651A (en) * | 2018-06-07 | 2018-11-13 | 南京邮电大学 | A kind of Social behaviors detection method based on reinforcing Bayes's classification |
CN108804651B (en) * | 2018-06-07 | 2022-08-19 | 南京邮电大学 | Social behavior detection method based on enhanced Bayesian classification |
CN108985721A (en) * | 2018-07-12 | 2018-12-11 | 燕山大学 | A kind of process for sorting mailings and system |
CN108985721B (en) * | 2018-07-12 | 2020-10-02 | 燕山大学 | Mail classification method and system |
CN109191354A (en) * | 2018-08-21 | 2019-01-11 | 安徽讯飞智能科技有限公司 | A kind of whole people society pipe task distribution method based on natural language processing |
CN110149268A (en) * | 2019-05-15 | 2019-08-20 | 深圳市趣创科技有限公司 | A kind of method and its system of automatic fitration spam |
CN110300054A (en) * | 2019-07-03 | 2019-10-01 | 论客科技(广州)有限公司 | The recognition methods of malice fishing mail and device |
CN110505144A (en) * | 2019-08-09 | 2019-11-26 | 世纪龙信息网络有限责任公司 | Process for sorting mailings, device, equipment and storage medium |
CN111079427A (en) * | 2019-12-20 | 2020-04-28 | 北京金睛云华科技有限公司 | Junk mail identification method and system |
CN111651598A (en) * | 2020-05-28 | 2020-09-11 | 上海勃池信息技术有限公司 | Spam text auditing device and method through center vector similarity matching |
CN112215002A (en) * | 2020-11-02 | 2021-01-12 | 浙江大学 | Electric power system text data classification method based on improved naive Bayes |
CN112699242A (en) * | 2021-01-11 | 2021-04-23 | 大连东软信息学院 | Method for identifying Chinese text author |
CN116016416A (en) * | 2023-03-24 | 2023-04-25 | 深圳市明源云科技有限公司 | Junk mail identification method, device, equipment and computer readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107086952A (en) | A kind of Bayesian SPAM Filtering method based on TF IDF Chinese word segmentations | |
CN106453033B (en) | Multi-level process for sorting mailings based on Mail Contents | |
CN103514174B (en) | A kind of file classification method and device | |
CN104463552B (en) | Calendar reminding generation method and device | |
Faguo et al. | Research on short text classification algorithm based on statistics and rules | |
CN103024746A (en) | System and method for processing spam short messages for telecommunication operator | |
CN103984703B (en) | Mail classification method and device | |
CN104050556B (en) | The feature selection approach and its detection method of a kind of spam | |
Christina et al. | Email spam filtering using supervised machine learning techniques | |
CN101908055B (en) | Method for setting information classification threshold for optimizing lam percentage and information filtering system using same | |
CN110781679B (en) | News event keyword mining method based on associated semantic chain network | |
CN105843851A (en) | Analyzing and extracting method and device of cheating mails | |
CN101295381A (en) | Junk mail detecting method | |
CN104731772B (en) | Improved feature evaluation function based Bayesian spam filtering method | |
CN110149268A (en) | A kind of method and its system of automatic fitration spam | |
CN102945246A (en) | Method and device for processing network information data | |
CN107544961A (en) | A kind of sentiment analysis method, equipment and its storage device of social media comment | |
Deng et al. | Research on a naive bayesian based short message filtering system | |
CN105337842B (en) | A kind of rubbish mail filtering method unrelated with content | |
CN105117466A (en) | Internet information screening system and method | |
Anitha et al. | Email spam filtering using machine learning based XGBoost classifier method | |
CN106230690B (en) | A kind of process for sorting mailings and system of combination user property | |
CN101329668A (en) | Method and apparatus for generating information regulation and method and system for judging information types | |
Mirza et al. | Evaluating efficiency of classifier for email spam detector using hybrid feature selection approaches | |
CN106294542B (en) | A kind of letters and calls data mining methods of marking and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170822 |
|
WD01 | Invention patent application deemed withdrawn after publication |