CN101784022A - Method and system for filtering and classifying short messages - Google Patents

Method and system for filtering and classifying short messages Download PDF

Info

Publication number
CN101784022A
CN101784022A CN200910077123A CN200910077123A CN101784022A CN 101784022 A CN101784022 A CN 101784022A CN 200910077123 A CN200910077123 A CN 200910077123A CN 200910077123 A CN200910077123 A CN 200910077123A CN 101784022 A CN101784022 A CN 101784022A
Authority
CN
China
Prior art keywords
short message
filtering
regular expression
messages
refuse messages
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN200910077123A
Other languages
Chinese (zh)
Inventor
柳呈文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING YANHUANG XINXING NETWORK SCI-TECH Co Ltd
Original Assignee
BEIJING YANHUANG XINXING NETWORK SCI-TECH Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING YANHUANG XINXING NETWORK SCI-TECH Co Ltd filed Critical BEIJING YANHUANG XINXING NETWORK SCI-TECH Co Ltd
Priority to CN200910077123A priority Critical patent/CN101784022A/en
Publication of CN101784022A publication Critical patent/CN101784022A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention creatively provides a spam short message filtering method which is based on a mode of transmission quantity characteristics and short message content characteristics and combines a Chinese character regular expression and an improved bayesian algorithm on the basis of traditional short message filtering. At the same time of improving the identification accuracy rate of spam short messages, the false report rate and the missing report rate of the spam short messages are reduced, and meanwhile, the spam short messages are classified for a second time so as to be convenient for the personalized setting of users. The method comprises the following steps of: (1) preprocessing short message texts; (2) matching transmission quantity: matching transmission content and a transmission quantity; (3) carrying out morphology word segmentation by using the Chinese character regular expression and a dictionary and word property method; (4) classifying by using a spam short message classifier: calculating the probability through the improved bayesian algorithm and identifying the spam short messages and non-spam short messages by using a short message characteristic rule defined by the Chinese character regular expression; and (5) using the classification of a short message type affiliation classifier to classify and process the identified spam short messages.

Description

Filtering short message, sorting technique and system
Technical field:
The present invention is used for the interception of refuse messages, relates in particular to the method and system of telecom operators' sms center filtering short message and secondary classification.
Background technology:
SMS has become a kind of very important communication form of compatriots, yet we are in the harassing and wrecking of enjoying have to simultaneously easily face at any time between thumb " refuse messages ".Refuse messages not only brings harassing and wrecking to us, and more seriously refuse messages has become the instrument of some lawless persons' distributions and propagation criminal information.
Method for filtering short message at present commonly used and mechanism mainly contain: based on keyword filtration, content-based filtration, based on note traffic volume and transmit leg analysis and filter etc.Wherein most of filter types have been continued to use general junk information processing mode, and as Bayes, SVM, artificial neural net scheduling algorithm, any mode is used all certain drawback.As, rate of false alarm and rate of failing to report that keyword filters are higher, such as: " so-and-so company provides so-and-so to serve for a long time ", if this short message text is with " company ", " for a long time ", " provide ", can there be the high or high phenomenon of leakage discrimination of false recognition rate in speech such as " services " as the single filtration of keyword.The transmission frequency strobe utility of same calling number, as adopt a plurality of numbers to send in batches, can escape the transmission frequency strobe utility of same calling number like this.And, filtering short message function at present commonly used is that whole refuse messages is not distinguished ground filter type fully, can not carry out personalized customization at the user, such as: certain user goes for " house property class " note, and then " house property class " note should not done the refuse messages processing for this user.How, guarantee low rate of false alarm and rate of failing to report, and can make things convenient for Customer subscription information that it is problem anxious to be solved that strick precaution refuse messages that can be real is distributed indiscriminately in conjunction with multiple filter algorithm and mechanism.
Summary of the invention:
Patent of the present invention is in order to overcome the deficiency in the above-mentioned technology, on the basis of traditional filtering junk short messages, novelty proposes based on traffic volume feature and refuse messages content characteristic mode, in conjunction with " Chinese character regular expression ", the and " modified model bayesian algorithm " method of filtrating rubbish short message, when improving the refuse messages recognition accuracy, reduced the rate of false alarm and the rate of failing to report of refuse messages.The present invention carries out secondary classification with refuse messages, makes things convenient for the user individual setting, selectively shielding rubbish information.
For achieving the above object, method for filtering spam short messages of the present invention may further comprise the steps:
Step 1 is carried out preliminary treatment (keyword is handled, and black and white lists is handled) to short message text.
Step 2, traffic volume coupling, content and quantity forwarded that coupling sends.
Step 3, utilization " Chinese character regular expression " reaches that " dictionary adds part of speech, and " method is carried out the morphology participle.
Step 4 is used the classification of refuse messages grader, the note feature rule of utilization " Chinese character regular expression " definition,
Carry out calculating probability by the modified model bayesian algorithm, identification rubbish/non-refuse messages.
Step 5 is used the classification of short message type affiliation classifier, to the refuse messages of having discerned the processing of classifying.
Traffic volume coupling in the step 2 is meant that sending short message content in target note and the certain hour compares and mate, and calculates corresponding weighted value, as further parameters calculated.
The note feature rule of " Chinese character regular expression " definition is meant in the step 4, the rule that concerns based between short message text length, telephone number, address, network address (unit) and numerical chracter ratio, the phrase probability judges whether the strategy into refuse messages.
The modified model bayesian algorithm is meant on traditional bayesian algorithm basis in the step 4, and the degree of correlation of each characteristic attribute further is fused in the algorithm as weights.
Short message type affiliation classifier is to being judged as the information of refuse messages, carrying out the function of secondary classification in the step 5.
The present invention innovate in conjunction with above algorithm and mechanism, the advantage of the whole bag of tricks is combined, effectively filtrating rubbish short message the time, adopt classification customization mode, to exempt for the note that the user needs and filtering, be the systematic method that hommization more is used for filtering junk short messages.
Description of drawings:
Fig. 1 is the workflow diagram of filtering short message provided by the invention, categorizing system
Fig. 2 is the principle flow chart of filtering short message provided by the invention, sorting technique
Fig. 3 is filtering short message provided by the invention and secondary classification flow chart
Embodiment:
The method step that the invention provides filtering short message and secondary classification is as follows:
Step 1 is carried out preliminary treatment (keyword is handled, and black and white lists is handled) to short message text.
Before the participle, at first need short message content is carried out preliminary treatment, comprise contents processings such as deletion, standard, mark.Preliminary treatment can be played the effect of semantic segmentation, improves the accuracy of participle, some key characters of refuse messages content is carried out mark, for subsequent analysis lays the foundation.
At first invalid part in deletion or the mark short message content reduces and disturbs, and improves the efficient of subsequent treatment.
Unify conversion at short message content, become unified half-angle standard digital symbol, discern at some special variations in the short message content, as " O " expression " 0 ", " I " expression " 1 " etc. as the whole-angle figure symbol transition.
To some important signs, extract and identify as important refuse messages content characteristics such as telephone number, address, organization, name, network address mailboxes.
In Preprocessing Algorithm, use " Chinese character regular expression ", more flexible to processing such as punctuation mark, English, numerals, according to the up-to-date variation of short message content, add new rule and brought facility simultaneously.
System divides two-stage to adopt black and white lists and keyword to filter, and one-level is black and white lists and the keyword that systematic unity provides, and user class user can be according to the needs setting of self.
Step 2, traffic volume coupling, content and quantity forwarded that coupling sends.
Traffic volume is the key character that judges whether refuse messages.Sending note quantity or same plereme time note quantity according to same transmission number in the unit interval, all is the important evidence of judging.
The sms center that the traffic volume monitoring module inserts mobile operator obtains the traffic volume of all phone numbers in real time, and charges to identical content, a time period, the note quantity of transmission different user.
Step 3, utilization " Chinese character regular expression " reaches that " dictionary adds part of speech, and " method is carried out the morphology participle.
The morphology participle uses the Hash index to deposit dictionary in internal memory, can effectively improve participle efficient like this.Set up the secondary index dictionary, dictionary is by lead-in Hash index, and secondary word Hash index and residue word string group are formed.
Divide word algorithm to adopt reverse maximum matching algorithm, use inverted order mode subordinate clause end to begin participle, participle adopts the method for maximum phrase length coupling and keyword statement structural analysis.In text, cut to be no more than maximum phrase length text and mate, if this section text is a speech, extract this section text, and use same procedure to carry out participle at remaining text.
Become a string phrase behind the text participle, a lot of phrases all have multiple part of speech, also may have multiple semanteme, and the meaning needs of these phrases are following part of speech and the semanteme that could confirm it of special context based on context.Use fine the addressing this problem of grammer participle energy.The part of speech of analyzing each statement puts in order, and obtains common sentence pattern by model training.Trained Markov chain can effectively be carried out the part of speech participle, utilize the discrete time random process of Markov property, under the situation of given current knowledge or information, the past, (being current historic state in the past) was for haveing nothing to do prediction future (being current later to-be).
Use the advantage of Markov chain:
A. part-of-speech tagging can satisfy automatic marking model, and using approximate mode is acceptable.
B., a strong theoretical frame is provided, provides a direct effective means for getting rid of ambiguity.
C. required model parameter can be calculated from the given data estimation, promptly can obtain by training.
In the filtering short message system, the participle performance greatly influences classification performance, so when test, need use two criterion calculation participle performances:
Sentence level: performance=correct sentence number/total sentence number * 100%
Speech level: performance=correct speech number/total speech number * 100%
Step 4 is used the classification of refuse messages grader, and the note feature rule of utilization " Chinese character regular expression " definition is carried out calculating probability by the modified model bayesian algorithm, identification rubbish/non-refuse messages.
The refuse messages grader adopts inductive algorithm, extract the refuse messages feature, and the join probability statistical knowledge carries out classification algorithms.The note length and the information content are different from spam, carry out data mining through long-time research, extract characteristic, also test conclusion, the refuse messages feature mainly comprises: length characteristic, telephone number feature, address, network address, organization feature, the traffic volume feature, word probability characteristics etc.
Length characteristic
The ratio of the relative note length of refuse messages is normal distribution, uses the distribution situation of statistical method statistics Chinese character quantity in normal note and refuse messages.
Statistical rules:
A. there is not the note of Chinese character not add up.
B. use the note of the similar sample of deletion to add up.
Test result:
A. refuse messages is generally longer;
B. normal note is generally shorter;
C. the note of speciality, the ratio that normal note accounts for is big slightly;
The telephone number feature:
A large amount of refuse messages belong to the distribution class, and major part leaves contact methods such as telephone number.Telephone number is by the multiple mode of forming: literary composition numeral all over Britain, and complete Chinese whole-angle figure, hybrid mode, length have certain scope.
The refuse messages sender often adopts hybrid mode in order to escape word filter, even often uses character ' O ' Alternative digital ' 0 ', " I " replacement modes such as " 1 " to escape filtration.Use the Chinese character regular expression, the identification of the multiple compound mode telephone number of the solution that can imitate.
Address, network address, organization feature:
Refuse messages often contains other contact methods except that telephone number, often goes out contents such as current address, place, web site name, network address, organization.Use the coupling related content that Chinese character regular expression pattern can be successful.
The common form of network address: xxx.xxx.xxx, wherein xxx represents English alphabet or numeral.
IP mode: nnn.nnn.nnn.nnn, wherein nnn represents the numeral between 0~255.
The Chinese character expression formula is as follows:
{(http|HTTP)://[a-zA-Z0-9/_-]+(\.[a-zA-Z0-9/_-]+)+}
{(WWW|www)\.[a-zA-Z0-9_-]+(\.[a-zA-Z0-9/_-]+)+}
...
The common form in E-mail address: xxx@xxx.xxx, wherein xxx represents English alphabet, numeral or part symbol etc.
The Chinese character expression formula is as follows:
{[a-zA-Z0-9\._-]+@[a-zA-Z0-9_-]+(\.[a-zA-Z0-9_-]+)+}
...
The form of address is many, as XXX street, XXX road NNN number etc.
The Chinese character expression formula is as follows:
[~]+(road | the main road | the street) [0-9]+number
...
The organization form is just more, as XXX market, XXX hotel, XXX hotel, XXX company etc.
The Chinese character expression formula is as follows:
[~] [~]+(company | group | business department | the shop | the dining room | battalion ...)
...
The numerical chracter ratio:
It is one of resolution important evidence of refuse messages that the numerical chracter ratio is done, and belongs to normal note as order class note, the large percentage that numerical chracter accounts in short message content.
Statistical rules:
A. there is not the note of Chinese character not add up (note of pure numeral generally is normal note).
B. use the note of the similar sample of deletion to add up.
C. each continuous digital alphabet string is calculated a unit of measurement, and a unit of measurement calculated in each Chinese character.
Main conclusions:
A. refuse messages numeric word mother tuber accounting is generally less.
What B. numeric word mother tuber accounting was bigger is normal note substantially.
The refuse messages grader:
Short message text carries out the division of two-dimensional approach (rubbish/non-rubbish).Phrase and other features are as the attribute of distinguishing.Add up the probability that various features occur in rubbish and non-rubbish.To multiple text classifier, as Bayes, K-neighbour, algorithm of support vector machine relatively in the filtering junk short messages effect, at first adopt the modified model bayesian algorithm, each feature is used as attribute, probability vector in rubbish/non-rubbish, when short message text arrives,, calculate the probability of ownership rubbish and non-rubbish then according to participle and feature.
When the training short message text, need handle stop words, stop words is very high, the meaningless phrase of the frequency of occurrences, interjection, auxiliary word etc.The benefit of way has like this:
A. reduce the interference of phrase, improve the order of accuarcy of grader.
B. stop words frequency of occurrences height often, as carry out stop words and handle the text that promptly can reduce training, also reduce identification word identification number of times, improve efficient and the performance of discerning so greatly.
The correction of grader probability if phrase occurs in the text of seldom refuse messages, and never occurs at non-refuse messages text, and the probability of the refuse messages of this phrase is very high like this, and the probability of non-refuse messages is zero.So as long as this situation grader is exactly a refuse messages occurring discerning this text under the situation of this phrase, can cause the rate of false alarm of grader higher like this, to such an extent as to must revise to the training result of grader, be under zero the situation, to be modified to minimum probable value in the phrase probability storehouse at phrase probability of occurrence in a certain type.
Computing formula:
Suppose d i={ w I1, w I2..., w InBe an arbitrary document, it belongs to document class C={c 1, c 2..., c kIn a certain class c jHave according to the Bayes grader:
P ( c j | d i ) = P ( d i | c j ) P ( c j ) P ( d i ) ∝ P ( d i | c j ) P ( c j )
Wherein: P ( d i | c j ) = Π k = 1 r P ( w ik | c j )
Consider the correlation of Bayes attribute, in text classification, often simplify sorting algorithm, as suppose that all properties is separate, but in the evaluation and test of refuse messages classifying quality, the hypothesis of independence has greatly reduced classification performance, consider the complexity of algorithm and present problem such as device rate, adopting in twos, the association attributes algorithm greatly improves analytical performance.
Step 5 is used the classification of short message type affiliation classifier, to the refuse messages of having discerned the processing of classifying.
Because it is different that rubbish and non-rubbish define for different people, and a lot of types are arranged in the refuse messages: such as house property sales promotion class, invoice class, educational etc.Everyone may need to receive a certain classification note, must increase a type affiliation classifier at the refuse classification device like this.
Type affiliation classifier is only carried out probability calculation to phrase, and processing mode and refuse messages grader are similar, but must handle polytype classification.

Claims (9)

1. filtering short message, sorting technique and system, this method comprises:
Step 1 is carried out preliminary treatment (keyword is handled, and black and white lists is handled) to short message text.
Step 2, traffic volume coupling, content and quantity forwarded that coupling sends.
Step 3, utilization " Chinese character regular expression " reaches " dictionary adds part of speech " method and carries out the morphology participle.
Step 4 is used the classification of refuse messages grader, and the note feature rule of utilization " Chinese character regular expression " definition is carried out calculating probability by the modified model bayesian algorithm, identification rubbish/non-refuse messages.
Step 5 is used the classification of short message type affiliation classifier, to the refuse messages of having discerned the processing of classifying.
Patent of the present invention is on the basis of traditional filtering junk short messages, and novelty proposes to filter based on the method for feature, has improved the accuracy rate of refuse messages identification, has reduced the rate of false alarm and the rate of failing to report of refuse messages simultaneously.
2. that says according to claim 1 utilizes filtering short message, sorting technique, it is characterized in that, also comprises in the step 1: Preprocessing Algorithm is used Chinese character regular expression algorithm, handles more flexible to punctuation mark, English, numeral etc.
3. that says according to claim 1 utilizes filtering short message, sorting technique, it is characterized in that, also comprises in the step 2: according to identical short message content, and similar short message content, the note quantity in the unit interval.
4. that says according to claim 1 utilizes filtering short message, sorting technique, it is characterized in that, also comprises in the step 3: reverse coupling dictionary, and use Markov chain to carry out the part of speech correction.
5. that says according to claim 1 utilizes filtering short message, sorting technique, it is characterized in that, also comprises in the step 4: behind the short message text participle, extract the attribute of characteristic vector quantity.
6. utilized filtering short message, sorting technique as what claim 1,5 stated, it is characterized in that, also comprise in the step 4: based on recognition methods in the telephone number and regular expression.
7. utilized filtering short message, sorting technique as what claim 1,5,6 stated, it is characterized in that, also comprise in the step 4: address, network address (unit) Feature Recognition method, regular expression content.
8. utilized filtering short message, sorting technique as what claim 1,5,6,7 stated, it is characterized in that, also comprise in the step 4: the correction algorithm of modified model bayesian algorithm.
9. that says according to claim 1 utilizes filtering short message, sorting technique, it is characterized in that, also comprises in the step 5: the secondary classification behind the filtering junk short messages, carry out type affiliation.
CN200910077123A 2009-01-16 2009-01-16 Method and system for filtering and classifying short messages Pending CN101784022A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200910077123A CN101784022A (en) 2009-01-16 2009-01-16 Method and system for filtering and classifying short messages

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200910077123A CN101784022A (en) 2009-01-16 2009-01-16 Method and system for filtering and classifying short messages

Publications (1)

Publication Number Publication Date
CN101784022A true CN101784022A (en) 2010-07-21

Family

ID=42523793

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200910077123A Pending CN101784022A (en) 2009-01-16 2009-01-16 Method and system for filtering and classifying short messages

Country Status (1)

Country Link
CN (1) CN101784022A (en)

Cited By (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102368842A (en) * 2011-10-12 2012-03-07 中国联合网络通信集团有限公司 Detection method of abnormal behavior of mobile terminal and detection system thereof
CN102547623A (en) * 2010-12-08 2012-07-04 中国电信股份有限公司 Junk short message processing method and system
CN102790752A (en) * 2011-05-20 2012-11-21 盛乐信息技术(上海)有限公司 Fraud information filtering system and method on basis of feature identification
CN103024746A (en) * 2012-12-30 2013-04-03 清华大学 System and method for processing spam short messages for telecommunication operator
CN103049454A (en) * 2011-10-16 2013-04-17 同济大学 Chinese and English search result visualization system based on multi-label classification
CN103067896A (en) * 2013-01-17 2013-04-24 中国联合网络通信集团有限公司 Junk short message filtering method and device
CN103336766A (en) * 2013-07-04 2013-10-02 微梦创科网络科技(中国)有限公司 Short text garbage identification and modeling method and device
CN103425777A (en) * 2013-08-15 2013-12-04 北京大学 Intelligent short message classification and searching method based on improved Bayesian classification
CN103455754A (en) * 2013-09-05 2013-12-18 上海交通大学 Regular expression-based malicious search keyword recognition method
CN103577406A (en) * 2012-07-19 2014-02-12 深圳中兴网信科技有限公司 Method and device for managing unstructured data
CN103702301A (en) * 2013-12-31 2014-04-02 大连环宇移动科技有限公司 Real-time sensing control system for inter-internet short message service
CN103888921A (en) * 2013-09-21 2014-06-25 天津思博科科技发展有限公司 Short message intelligent deleting module
CN104010284A (en) * 2014-05-30 2014-08-27 可牛网络技术(北京)有限公司 Method and device for processing spam short message
CN104168548A (en) * 2014-08-21 2014-11-26 北京奇虎科技有限公司 Short message intercepting method and device and cloud server
WO2015032123A1 (en) * 2013-09-04 2015-03-12 盈世信息科技(北京)有限公司 Method and device for extracting number from e-mail
CN104469709A (en) * 2013-09-13 2015-03-25 联想(北京)有限公司 Method for recognizing short message and electronic equipment
CN104714938A (en) * 2013-12-12 2015-06-17 联想(北京)有限公司 Message processing method and electronic device
CN105138611A (en) * 2015-08-07 2015-12-09 北京奇虎科技有限公司 Short message type identification method and device
CN105205079A (en) * 2014-06-26 2015-12-30 联想(北京)有限公司 Information processing method and electronic equipment
CN105282720A (en) * 2014-07-23 2016-01-27 中国移动通信集团重庆有限公司 Junk short message filtering method and device
WO2016177148A1 (en) * 2015-08-18 2016-11-10 中兴通讯股份有限公司 Short message interception method and device
CN106202330A (en) * 2016-07-01 2016-12-07 北京小米移动软件有限公司 The determination methods of junk information and device
CN102421074B (en) * 2011-07-26 2017-05-10 中兴通讯股份有限公司 Short message monitoring method and device
CN106934008A (en) * 2017-02-15 2017-07-07 北京时间股份有限公司 A kind of recognition methods of junk information and device
CN107168951A (en) * 2017-05-10 2017-09-15 山东大学 A kind of rule-based prison inmates short message automatic auditing method with dictionary
CN107548027A (en) * 2017-07-28 2018-01-05 中国移动通信集团江苏有限公司 Data push method, device, equipment and computer-readable storage medium
CN108093376A (en) * 2016-11-21 2018-05-29 中国移动通信有限公司研究院 The filter method and device of a kind of refuse messages
CN109446527A (en) * 2018-10-26 2019-03-08 广东小天才科技有限公司 A kind of analysis method and system of meaningless corpus
CN109471920A (en) * 2018-11-19 2019-03-15 北京锐安科技有限公司 A kind of method, apparatus of Text Flag, electronic equipment and storage medium
CN109922444A (en) * 2017-12-13 2019-06-21 中国移动通信集团公司 A kind of refuse messages recognition methods and device
CN110968687A (en) * 2018-09-30 2020-04-07 北京国双科技有限公司 Method and device for classifying texts
CN111310452A (en) * 2018-12-12 2020-06-19 北京京东尚科信息技术有限公司 Word segmentation method and device
CN111339250A (en) * 2020-02-20 2020-06-26 北京百度网讯科技有限公司 Mining method of new category label, electronic equipment and computer readable medium
CN112714447A (en) * 2020-12-22 2021-04-27 南京翼启莱信息技术有限公司 Platform short message purification method based on mobile phone number and short message content dual-mode detection

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101094135A (en) * 2006-06-23 2007-12-26 腾讯科技(深圳)有限公司 Method and system for extracting information of content in Internet
CN101184259A (en) * 2007-11-01 2008-05-21 浙江大学 Keyword automatically learning and updating method in rubbish short message
CN101257671A (en) * 2007-07-06 2008-09-03 浙江大学 Method for real time filtering large scale rubbish SMS based on content

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101094135A (en) * 2006-06-23 2007-12-26 腾讯科技(深圳)有限公司 Method and system for extracting information of content in Internet
CN101257671A (en) * 2007-07-06 2008-09-03 浙江大学 Method for real time filtering large scale rubbish SMS based on content
CN101184259A (en) * 2007-11-01 2008-05-21 浙江大学 Keyword automatically learning and updating method in rubbish short message

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
赵治国,谭敏生,李志敏: "基于改进贝叶斯的垃圾邮件过滤算法综述", 《南华大学学报(自然科学版)》 *

Cited By (50)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102547623A (en) * 2010-12-08 2012-07-04 中国电信股份有限公司 Junk short message processing method and system
CN102547623B (en) * 2010-12-08 2015-05-20 中国电信股份有限公司 Junk short message processing method and system
CN102790752A (en) * 2011-05-20 2012-11-21 盛乐信息技术(上海)有限公司 Fraud information filtering system and method on basis of feature identification
CN102421074B (en) * 2011-07-26 2017-05-10 中兴通讯股份有限公司 Short message monitoring method and device
CN102368842A (en) * 2011-10-12 2012-03-07 中国联合网络通信集团有限公司 Detection method of abnormal behavior of mobile terminal and detection system thereof
CN102368842B (en) * 2011-10-12 2013-03-20 中国联合网络通信集团有限公司 Detection method of abnormal behavior of mobile terminal and detection system thereof
CN103049454A (en) * 2011-10-16 2013-04-17 同济大学 Chinese and English search result visualization system based on multi-label classification
CN103577406A (en) * 2012-07-19 2014-02-12 深圳中兴网信科技有限公司 Method and device for managing unstructured data
CN103577406B (en) * 2012-07-19 2019-04-16 深圳中兴网信科技有限公司 A kind of method and device managing unstructured data
CN103024746A (en) * 2012-12-30 2013-04-03 清华大学 System and method for processing spam short messages for telecommunication operator
CN103024746B (en) * 2012-12-30 2015-06-17 清华大学 System and method for processing spam short messages for telecommunication operator
CN103067896A (en) * 2013-01-17 2013-04-24 中国联合网络通信集团有限公司 Junk short message filtering method and device
CN103067896B (en) * 2013-01-17 2015-08-19 中国联合网络通信集团有限公司 Method for filtering spam short messages and device
CN103336766A (en) * 2013-07-04 2013-10-02 微梦创科网络科技(中国)有限公司 Short text garbage identification and modeling method and device
CN103336766B (en) * 2013-07-04 2016-12-28 微梦创科网络科技(中国)有限公司 Short text garbage identification and modeling method and device
CN103425777B (en) * 2013-08-15 2016-12-28 北京大学 A kind of based on the short message intelligent classification and the searching method that improve Bayes's classification
CN103425777A (en) * 2013-08-15 2013-12-04 北京大学 Intelligent short message classification and searching method based on improved Bayesian classification
WO2015032123A1 (en) * 2013-09-04 2015-03-12 盈世信息科技(北京)有限公司 Method and device for extracting number from e-mail
CN103455754A (en) * 2013-09-05 2013-12-18 上海交通大学 Regular expression-based malicious search keyword recognition method
CN103455754B (en) * 2013-09-05 2016-05-04 上海交通大学 A kind of malicious searches keyword recognition methods based on regular expression
CN104469709A (en) * 2013-09-13 2015-03-25 联想(北京)有限公司 Method for recognizing short message and electronic equipment
CN103888921A (en) * 2013-09-21 2014-06-25 天津思博科科技发展有限公司 Short message intelligent deleting module
CN104714938B (en) * 2013-12-12 2017-12-29 联想(北京)有限公司 The method and electronic equipment of a kind of information processing
CN104714938A (en) * 2013-12-12 2015-06-17 联想(北京)有限公司 Message processing method and electronic device
CN103702301A (en) * 2013-12-31 2014-04-02 大连环宇移动科技有限公司 Real-time sensing control system for inter-internet short message service
CN104010284A (en) * 2014-05-30 2014-08-27 可牛网络技术(北京)有限公司 Method and device for processing spam short message
CN105205079A (en) * 2014-06-26 2015-12-30 联想(北京)有限公司 Information processing method and electronic equipment
CN105282720A (en) * 2014-07-23 2016-01-27 中国移动通信集团重庆有限公司 Junk short message filtering method and device
CN105282720B (en) * 2014-07-23 2018-12-04 中国移动通信集团重庆有限公司 A kind of method for filtering spam short messages and device
CN104168548A (en) * 2014-08-21 2014-11-26 北京奇虎科技有限公司 Short message intercepting method and device and cloud server
CN105138611A (en) * 2015-08-07 2015-12-09 北京奇虎科技有限公司 Short message type identification method and device
WO2016177148A1 (en) * 2015-08-18 2016-11-10 中兴通讯股份有限公司 Short message interception method and device
CN106202330A (en) * 2016-07-01 2016-12-07 北京小米移动软件有限公司 The determination methods of junk information and device
CN108093376A (en) * 2016-11-21 2018-05-29 中国移动通信有限公司研究院 The filter method and device of a kind of refuse messages
CN106934008B (en) * 2017-02-15 2020-07-21 北京时间股份有限公司 Junk information identification method and device
CN106934008A (en) * 2017-02-15 2017-07-07 北京时间股份有限公司 A kind of recognition methods of junk information and device
CN107168951A (en) * 2017-05-10 2017-09-15 山东大学 A kind of rule-based prison inmates short message automatic auditing method with dictionary
CN107548027A (en) * 2017-07-28 2018-01-05 中国移动通信集团江苏有限公司 Data push method, device, equipment and computer-readable storage medium
CN109922444B (en) * 2017-12-13 2020-11-03 中国移动通信集团公司 Spam message identification method and device
CN109922444A (en) * 2017-12-13 2019-06-21 中国移动通信集团公司 A kind of refuse messages recognition methods and device
CN110968687A (en) * 2018-09-30 2020-04-07 北京国双科技有限公司 Method and device for classifying texts
CN110968687B (en) * 2018-09-30 2023-06-16 北京国双科技有限公司 Method and device for classifying text
CN109446527A (en) * 2018-10-26 2019-03-08 广东小天才科技有限公司 A kind of analysis method and system of meaningless corpus
CN109446527B (en) * 2018-10-26 2023-10-20 广东小天才科技有限公司 Nonsensical corpus analysis method and system
CN109471920A (en) * 2018-11-19 2019-03-15 北京锐安科技有限公司 A kind of method, apparatus of Text Flag, electronic equipment and storage medium
CN111310452A (en) * 2018-12-12 2020-06-19 北京京东尚科信息技术有限公司 Word segmentation method and device
CN111339250A (en) * 2020-02-20 2020-06-26 北京百度网讯科技有限公司 Mining method of new category label, electronic equipment and computer readable medium
CN111339250B (en) * 2020-02-20 2023-08-18 北京百度网讯科技有限公司 Mining method for new category labels, electronic equipment and computer readable medium
US11755654B2 (en) 2020-02-20 2023-09-12 Beijing Baidu Netcom Science Technology Co., Ltd. Category tag mining method, electronic device and non-transitory computer-readable storage medium
CN112714447A (en) * 2020-12-22 2021-04-27 南京翼启莱信息技术有限公司 Platform short message purification method based on mobile phone number and short message content dual-mode detection

Similar Documents

Publication Publication Date Title
CN101784022A (en) Method and system for filtering and classifying short messages
CN103024746B (en) System and method for processing spam short messages for telecommunication operator
CN106550155B (en) Swindle sample is carried out to suspicious number and screens the method and system sorted out and intercepted
CN101257671B (en) Method for real time filtering large scale rubbish SMS based on content
CN101184259B (en) Keyword automatically learning and updating method in rubbish short message
CN102591854B (en) For advertisement filtering system and the filter method thereof of text feature
CN103037339B (en) One kind is based on the short message filter method of " user's credit worthiness and short message spam degree "
CN102096703B (en) Filtering method and equipment of short messages
CN107908715A (en) Microblog emotional polarity discriminating method based on Adaboost and grader Weighted Fusion
CN101996241A (en) Bayesian algorithm-based content filtering method
CN107609103A (en) It is a kind of based on push away spy event detecting method
CN103634473A (en) Naive Bayesian classification based mobile phone spam short message filtering method and system
CN102088697A (en) Method and system for processing spam
CN109947934B (en) Data mining method and system for short text
CN105589845A (en) Junk text recognizing method, device and system
CN103108290A (en) Short message handling method and device
CN106649338B (en) Information filtering strategy generation method and device
CN101329668A (en) Method and apparatus for generating information regulation and method and system for judging information types
KR20060087735A (en) System and method for proceeding improved spam message filtering
CN110059189B (en) Game platform message classification system and method
CN112492606A (en) Classification and identification method and device for spam messages, computer equipment and storage medium
CN110232159A (en) A kind of public sentiment intelligent analysis method based on big data
Li et al. A Vector Space Model based spam SMS filter
CN112380323A (en) Junk information removing system and method based on Chinese word segmentation recognition technology
CN105404670B (en) Harass short message method of discrimination and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20100721