CN101784022A - Method and system for filtering and classifying short messages - Google Patents
Method and system for filtering and classifying short messages Download PDFInfo
- Publication number
- CN101784022A CN101784022A CN200910077123A CN200910077123A CN101784022A CN 101784022 A CN101784022 A CN 101784022A CN 200910077123 A CN200910077123 A CN 200910077123A CN 200910077123 A CN200910077123 A CN 200910077123A CN 101784022 A CN101784022 A CN 101784022A
- Authority
- CN
- China
- Prior art keywords
- short message
- filtering
- regular expression
- messages
- refuse messages
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Abstract
The invention creatively provides a spam short message filtering method which is based on a mode of transmission quantity characteristics and short message content characteristics and combines a Chinese character regular expression and an improved bayesian algorithm on the basis of traditional short message filtering. At the same time of improving the identification accuracy rate of spam short messages, the false report rate and the missing report rate of the spam short messages are reduced, and meanwhile, the spam short messages are classified for a second time so as to be convenient for the personalized setting of users. The method comprises the following steps of: (1) preprocessing short message texts; (2) matching transmission quantity: matching transmission content and a transmission quantity; (3) carrying out morphology word segmentation by using the Chinese character regular expression and a dictionary and word property method; (4) classifying by using a spam short message classifier: calculating the probability through the improved bayesian algorithm and identifying the spam short messages and non-spam short messages by using a short message characteristic rule defined by the Chinese character regular expression; and (5) using the classification of a short message type affiliation classifier to classify and process the identified spam short messages.
Description
Technical field:
The present invention is used for the interception of refuse messages, relates in particular to the method and system of telecom operators' sms center filtering short message and secondary classification.
Background technology:
SMS has become a kind of very important communication form of compatriots, yet we are in the harassing and wrecking of enjoying have to simultaneously easily face at any time between thumb " refuse messages ".Refuse messages not only brings harassing and wrecking to us, and more seriously refuse messages has become the instrument of some lawless persons' distributions and propagation criminal information.
Method for filtering short message at present commonly used and mechanism mainly contain: based on keyword filtration, content-based filtration, based on note traffic volume and transmit leg analysis and filter etc.Wherein most of filter types have been continued to use general junk information processing mode, and as Bayes, SVM, artificial neural net scheduling algorithm, any mode is used all certain drawback.As, rate of false alarm and rate of failing to report that keyword filters are higher, such as: " so-and-so company provides so-and-so to serve for a long time ", if this short message text is with " company ", " for a long time ", " provide ", can there be the high or high phenomenon of leakage discrimination of false recognition rate in speech such as " services " as the single filtration of keyword.The transmission frequency strobe utility of same calling number, as adopt a plurality of numbers to send in batches, can escape the transmission frequency strobe utility of same calling number like this.And, filtering short message function at present commonly used is that whole refuse messages is not distinguished ground filter type fully, can not carry out personalized customization at the user, such as: certain user goes for " house property class " note, and then " house property class " note should not done the refuse messages processing for this user.How, guarantee low rate of false alarm and rate of failing to report, and can make things convenient for Customer subscription information that it is problem anxious to be solved that strick precaution refuse messages that can be real is distributed indiscriminately in conjunction with multiple filter algorithm and mechanism.
Summary of the invention:
Patent of the present invention is in order to overcome the deficiency in the above-mentioned technology, on the basis of traditional filtering junk short messages, novelty proposes based on traffic volume feature and refuse messages content characteristic mode, in conjunction with " Chinese character regular expression ", the and " modified model bayesian algorithm " method of filtrating rubbish short message, when improving the refuse messages recognition accuracy, reduced the rate of false alarm and the rate of failing to report of refuse messages.The present invention carries out secondary classification with refuse messages, makes things convenient for the user individual setting, selectively shielding rubbish information.
For achieving the above object, method for filtering spam short messages of the present invention may further comprise the steps:
Step 1 is carried out preliminary treatment (keyword is handled, and black and white lists is handled) to short message text.
Step 2, traffic volume coupling, content and quantity forwarded that coupling sends.
Step 3, utilization " Chinese character regular expression " reaches that " dictionary adds part of speech, and " method is carried out the morphology participle.
Step 4 is used the classification of refuse messages grader, the note feature rule of utilization " Chinese character regular expression " definition,
Carry out calculating probability by the modified model bayesian algorithm, identification rubbish/non-refuse messages.
Step 5 is used the classification of short message type affiliation classifier, to the refuse messages of having discerned the processing of classifying.
Traffic volume coupling in the step 2 is meant that sending short message content in target note and the certain hour compares and mate, and calculates corresponding weighted value, as further parameters calculated.
The note feature rule of " Chinese character regular expression " definition is meant in the step 4, the rule that concerns based between short message text length, telephone number, address, network address (unit) and numerical chracter ratio, the phrase probability judges whether the strategy into refuse messages.
The modified model bayesian algorithm is meant on traditional bayesian algorithm basis in the step 4, and the degree of correlation of each characteristic attribute further is fused in the algorithm as weights.
Short message type affiliation classifier is to being judged as the information of refuse messages, carrying out the function of secondary classification in the step 5.
The present invention innovate in conjunction with above algorithm and mechanism, the advantage of the whole bag of tricks is combined, effectively filtrating rubbish short message the time, adopt classification customization mode, to exempt for the note that the user needs and filtering, be the systematic method that hommization more is used for filtering junk short messages.
Description of drawings:
Fig. 1 is the workflow diagram of filtering short message provided by the invention, categorizing system
Fig. 2 is the principle flow chart of filtering short message provided by the invention, sorting technique
Fig. 3 is filtering short message provided by the invention and secondary classification flow chart
Embodiment:
The method step that the invention provides filtering short message and secondary classification is as follows:
Step 1 is carried out preliminary treatment (keyword is handled, and black and white lists is handled) to short message text.
Before the participle, at first need short message content is carried out preliminary treatment, comprise contents processings such as deletion, standard, mark.Preliminary treatment can be played the effect of semantic segmentation, improves the accuracy of participle, some key characters of refuse messages content is carried out mark, for subsequent analysis lays the foundation.
At first invalid part in deletion or the mark short message content reduces and disturbs, and improves the efficient of subsequent treatment.
Unify conversion at short message content, become unified half-angle standard digital symbol, discern at some special variations in the short message content, as " O " expression " 0 ", " I " expression " 1 " etc. as the whole-angle figure symbol transition.
To some important signs, extract and identify as important refuse messages content characteristics such as telephone number, address, organization, name, network address mailboxes.
In Preprocessing Algorithm, use " Chinese character regular expression ", more flexible to processing such as punctuation mark, English, numerals, according to the up-to-date variation of short message content, add new rule and brought facility simultaneously.
System divides two-stage to adopt black and white lists and keyword to filter, and one-level is black and white lists and the keyword that systematic unity provides, and user class user can be according to the needs setting of self.
Step 2, traffic volume coupling, content and quantity forwarded that coupling sends.
Traffic volume is the key character that judges whether refuse messages.Sending note quantity or same plereme time note quantity according to same transmission number in the unit interval, all is the important evidence of judging.
The sms center that the traffic volume monitoring module inserts mobile operator obtains the traffic volume of all phone numbers in real time, and charges to identical content, a time period, the note quantity of transmission different user.
Step 3, utilization " Chinese character regular expression " reaches that " dictionary adds part of speech, and " method is carried out the morphology participle.
The morphology participle uses the Hash index to deposit dictionary in internal memory, can effectively improve participle efficient like this.Set up the secondary index dictionary, dictionary is by lead-in Hash index, and secondary word Hash index and residue word string group are formed.
Divide word algorithm to adopt reverse maximum matching algorithm, use inverted order mode subordinate clause end to begin participle, participle adopts the method for maximum phrase length coupling and keyword statement structural analysis.In text, cut to be no more than maximum phrase length text and mate, if this section text is a speech, extract this section text, and use same procedure to carry out participle at remaining text.
Become a string phrase behind the text participle, a lot of phrases all have multiple part of speech, also may have multiple semanteme, and the meaning needs of these phrases are following part of speech and the semanteme that could confirm it of special context based on context.Use fine the addressing this problem of grammer participle energy.The part of speech of analyzing each statement puts in order, and obtains common sentence pattern by model training.Trained Markov chain can effectively be carried out the part of speech participle, utilize the discrete time random process of Markov property, under the situation of given current knowledge or information, the past, (being current historic state in the past) was for haveing nothing to do prediction future (being current later to-be).
Use the advantage of Markov chain:
A. part-of-speech tagging can satisfy automatic marking model, and using approximate mode is acceptable.
B., a strong theoretical frame is provided, provides a direct effective means for getting rid of ambiguity.
C. required model parameter can be calculated from the given data estimation, promptly can obtain by training.
In the filtering short message system, the participle performance greatly influences classification performance, so when test, need use two criterion calculation participle performances:
Sentence level: performance=correct sentence number/total sentence number * 100%
Speech level: performance=correct speech number/total speech number * 100%
Step 4 is used the classification of refuse messages grader, and the note feature rule of utilization " Chinese character regular expression " definition is carried out calculating probability by the modified model bayesian algorithm, identification rubbish/non-refuse messages.
The refuse messages grader adopts inductive algorithm, extract the refuse messages feature, and the join probability statistical knowledge carries out classification algorithms.The note length and the information content are different from spam, carry out data mining through long-time research, extract characteristic, also test conclusion, the refuse messages feature mainly comprises: length characteristic, telephone number feature, address, network address, organization feature, the traffic volume feature, word probability characteristics etc.
Length characteristic
The ratio of the relative note length of refuse messages is normal distribution, uses the distribution situation of statistical method statistics Chinese character quantity in normal note and refuse messages.
Statistical rules:
A. there is not the note of Chinese character not add up.
B. use the note of the similar sample of deletion to add up.
Test result:
A. refuse messages is generally longer;
B. normal note is generally shorter;
C. the note of speciality, the ratio that normal note accounts for is big slightly;
The telephone number feature:
A large amount of refuse messages belong to the distribution class, and major part leaves contact methods such as telephone number.Telephone number is by the multiple mode of forming: literary composition numeral all over Britain, and complete Chinese whole-angle figure, hybrid mode, length have certain scope.
The refuse messages sender often adopts hybrid mode in order to escape word filter, even often uses character ' O ' Alternative digital ' 0 ', " I " replacement modes such as " 1 " to escape filtration.Use the Chinese character regular expression, the identification of the multiple compound mode telephone number of the solution that can imitate.
Address, network address, organization feature:
Refuse messages often contains other contact methods except that telephone number, often goes out contents such as current address, place, web site name, network address, organization.Use the coupling related content that Chinese character regular expression pattern can be successful.
The common form of network address: xxx.xxx.xxx, wherein xxx represents English alphabet or numeral.
IP mode: nnn.nnn.nnn.nnn, wherein nnn represents the numeral between 0~255.
The Chinese character expression formula is as follows:
{(http|HTTP)://[a-zA-Z0-9/_-]+(\.[a-zA-Z0-9/_-]+)+}
{(WWW|www)\.[a-zA-Z0-9_-]+(\.[a-zA-Z0-9/_-]+)+}
...
The common form in E-mail address: xxx@xxx.xxx, wherein xxx represents English alphabet, numeral or part symbol etc.
The Chinese character expression formula is as follows:
{[a-zA-Z0-9\._-]+@[a-zA-Z0-9_-]+(\.[a-zA-Z0-9_-]+)+}
...
The form of address is many, as XXX street, XXX road NNN number etc.
The Chinese character expression formula is as follows:
[~]+(road | the main road | the street) [0-9]+number
...
The organization form is just more, as XXX market, XXX hotel, XXX hotel, XXX company etc.
The Chinese character expression formula is as follows:
[~] [~]+(company | group | business department | the shop | the dining room | battalion ...)
...
The numerical chracter ratio:
It is one of resolution important evidence of refuse messages that the numerical chracter ratio is done, and belongs to normal note as order class note, the large percentage that numerical chracter accounts in short message content.
Statistical rules:
A. there is not the note of Chinese character not add up (note of pure numeral generally is normal note).
B. use the note of the similar sample of deletion to add up.
C. each continuous digital alphabet string is calculated a unit of measurement, and a unit of measurement calculated in each Chinese character.
Main conclusions:
A. refuse messages numeric word mother tuber accounting is generally less.
What B. numeric word mother tuber accounting was bigger is normal note substantially.
The refuse messages grader:
Short message text carries out the division of two-dimensional approach (rubbish/non-rubbish).Phrase and other features are as the attribute of distinguishing.Add up the probability that various features occur in rubbish and non-rubbish.To multiple text classifier, as Bayes, K-neighbour, algorithm of support vector machine relatively in the filtering junk short messages effect, at first adopt the modified model bayesian algorithm, each feature is used as attribute, probability vector in rubbish/non-rubbish, when short message text arrives,, calculate the probability of ownership rubbish and non-rubbish then according to participle and feature.
When the training short message text, need handle stop words, stop words is very high, the meaningless phrase of the frequency of occurrences, interjection, auxiliary word etc.The benefit of way has like this:
A. reduce the interference of phrase, improve the order of accuarcy of grader.
B. stop words frequency of occurrences height often, as carry out stop words and handle the text that promptly can reduce training, also reduce identification word identification number of times, improve efficient and the performance of discerning so greatly.
The correction of grader probability if phrase occurs in the text of seldom refuse messages, and never occurs at non-refuse messages text, and the probability of the refuse messages of this phrase is very high like this, and the probability of non-refuse messages is zero.So as long as this situation grader is exactly a refuse messages occurring discerning this text under the situation of this phrase, can cause the rate of false alarm of grader higher like this, to such an extent as to must revise to the training result of grader, be under zero the situation, to be modified to minimum probable value in the phrase probability storehouse at phrase probability of occurrence in a certain type.
Computing formula:
Suppose d
i={ w
I1, w
I2..., w
InBe an arbitrary document, it belongs to document class C={c
1, c
2..., c
kIn a certain class c
jHave according to the Bayes grader:
Wherein:
Consider the correlation of Bayes attribute, in text classification, often simplify sorting algorithm, as suppose that all properties is separate, but in the evaluation and test of refuse messages classifying quality, the hypothesis of independence has greatly reduced classification performance, consider the complexity of algorithm and present problem such as device rate, adopting in twos, the association attributes algorithm greatly improves analytical performance.
Step 5 is used the classification of short message type affiliation classifier, to the refuse messages of having discerned the processing of classifying.
Because it is different that rubbish and non-rubbish define for different people, and a lot of types are arranged in the refuse messages: such as house property sales promotion class, invoice class, educational etc.Everyone may need to receive a certain classification note, must increase a type affiliation classifier at the refuse classification device like this.
Type affiliation classifier is only carried out probability calculation to phrase, and processing mode and refuse messages grader are similar, but must handle polytype classification.
Claims (9)
1. filtering short message, sorting technique and system, this method comprises:
Step 1 is carried out preliminary treatment (keyword is handled, and black and white lists is handled) to short message text.
Step 2, traffic volume coupling, content and quantity forwarded that coupling sends.
Step 3, utilization " Chinese character regular expression " reaches " dictionary adds part of speech " method and carries out the morphology participle.
Step 4 is used the classification of refuse messages grader, and the note feature rule of utilization " Chinese character regular expression " definition is carried out calculating probability by the modified model bayesian algorithm, identification rubbish/non-refuse messages.
Step 5 is used the classification of short message type affiliation classifier, to the refuse messages of having discerned the processing of classifying.
Patent of the present invention is on the basis of traditional filtering junk short messages, and novelty proposes to filter based on the method for feature, has improved the accuracy rate of refuse messages identification, has reduced the rate of false alarm and the rate of failing to report of refuse messages simultaneously.
2. that says according to claim 1 utilizes filtering short message, sorting technique, it is characterized in that, also comprises in the step 1: Preprocessing Algorithm is used Chinese character regular expression algorithm, handles more flexible to punctuation mark, English, numeral etc.
3. that says according to claim 1 utilizes filtering short message, sorting technique, it is characterized in that, also comprises in the step 2: according to identical short message content, and similar short message content, the note quantity in the unit interval.
4. that says according to claim 1 utilizes filtering short message, sorting technique, it is characterized in that, also comprises in the step 3: reverse coupling dictionary, and use Markov chain to carry out the part of speech correction.
5. that says according to claim 1 utilizes filtering short message, sorting technique, it is characterized in that, also comprises in the step 4: behind the short message text participle, extract the attribute of characteristic vector quantity.
6. utilized filtering short message, sorting technique as what claim 1,5 stated, it is characterized in that, also comprise in the step 4: based on recognition methods in the telephone number and regular expression.
7. utilized filtering short message, sorting technique as what claim 1,5,6 stated, it is characterized in that, also comprise in the step 4: address, network address (unit) Feature Recognition method, regular expression content.
8. utilized filtering short message, sorting technique as what claim 1,5,6,7 stated, it is characterized in that, also comprise in the step 4: the correction algorithm of modified model bayesian algorithm.
9. that says according to claim 1 utilizes filtering short message, sorting technique, it is characterized in that, also comprises in the step 5: the secondary classification behind the filtering junk short messages, carry out type affiliation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200910077123A CN101784022A (en) | 2009-01-16 | 2009-01-16 | Method and system for filtering and classifying short messages |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200910077123A CN101784022A (en) | 2009-01-16 | 2009-01-16 | Method and system for filtering and classifying short messages |
Publications (1)
Publication Number | Publication Date |
---|---|
CN101784022A true CN101784022A (en) | 2010-07-21 |
Family
ID=42523793
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN200910077123A Pending CN101784022A (en) | 2009-01-16 | 2009-01-16 | Method and system for filtering and classifying short messages |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101784022A (en) |
Cited By (34)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102368842A (en) * | 2011-10-12 | 2012-03-07 | 中国联合网络通信集团有限公司 | Detection method of abnormal behavior of mobile terminal and detection system thereof |
CN102547623A (en) * | 2010-12-08 | 2012-07-04 | 中国电信股份有限公司 | Junk short message processing method and system |
CN102790752A (en) * | 2011-05-20 | 2012-11-21 | 盛乐信息技术(上海)有限公司 | Fraud information filtering system and method on basis of feature identification |
CN103024746A (en) * | 2012-12-30 | 2013-04-03 | 清华大学 | System and method for processing spam short messages for telecommunication operator |
CN103049454A (en) * | 2011-10-16 | 2013-04-17 | 同济大学 | Chinese and English search result visualization system based on multi-label classification |
CN103067896A (en) * | 2013-01-17 | 2013-04-24 | 中国联合网络通信集团有限公司 | Junk short message filtering method and device |
CN103336766A (en) * | 2013-07-04 | 2013-10-02 | 微梦创科网络科技(中国)有限公司 | Short text garbage identification and modeling method and device |
CN103425777A (en) * | 2013-08-15 | 2013-12-04 | 北京大学 | Intelligent short message classification and searching method based on improved Bayesian classification |
CN103455754A (en) * | 2013-09-05 | 2013-12-18 | 上海交通大学 | Regular expression-based malicious search keyword recognition method |
CN103577406A (en) * | 2012-07-19 | 2014-02-12 | 深圳中兴网信科技有限公司 | Method and device for managing unstructured data |
CN103702301A (en) * | 2013-12-31 | 2014-04-02 | 大连环宇移动科技有限公司 | Real-time sensing control system for inter-internet short message service |
CN103888921A (en) * | 2013-09-21 | 2014-06-25 | 天津思博科科技发展有限公司 | Short message intelligent deleting module |
CN104010284A (en) * | 2014-05-30 | 2014-08-27 | 可牛网络技术(北京)有限公司 | Method and device for processing spam short message |
CN104168548A (en) * | 2014-08-21 | 2014-11-26 | 北京奇虎科技有限公司 | Short message intercepting method and device and cloud server |
WO2015032123A1 (en) * | 2013-09-04 | 2015-03-12 | 盈世信息科技(北京)有限公司 | Method and device for extracting number from e-mail |
CN104469709A (en) * | 2013-09-13 | 2015-03-25 | 联想(北京)有限公司 | Method for recognizing short message and electronic equipment |
CN104714938A (en) * | 2013-12-12 | 2015-06-17 | 联想(北京)有限公司 | Message processing method and electronic device |
CN105138611A (en) * | 2015-08-07 | 2015-12-09 | 北京奇虎科技有限公司 | Short message type identification method and device |
CN105205079A (en) * | 2014-06-26 | 2015-12-30 | 联想(北京)有限公司 | Information processing method and electronic equipment |
CN105282720A (en) * | 2014-07-23 | 2016-01-27 | 中国移动通信集团重庆有限公司 | Junk short message filtering method and device |
WO2016177148A1 (en) * | 2015-08-18 | 2016-11-10 | 中兴通讯股份有限公司 | Short message interception method and device |
CN106202330A (en) * | 2016-07-01 | 2016-12-07 | 北京小米移动软件有限公司 | The determination methods of junk information and device |
CN102421074B (en) * | 2011-07-26 | 2017-05-10 | 中兴通讯股份有限公司 | Short message monitoring method and device |
CN106934008A (en) * | 2017-02-15 | 2017-07-07 | 北京时间股份有限公司 | A kind of recognition methods of junk information and device |
CN107168951A (en) * | 2017-05-10 | 2017-09-15 | 山东大学 | A kind of rule-based prison inmates short message automatic auditing method with dictionary |
CN107548027A (en) * | 2017-07-28 | 2018-01-05 | 中国移动通信集团江苏有限公司 | Data push method, device, equipment and computer-readable storage medium |
CN108093376A (en) * | 2016-11-21 | 2018-05-29 | 中国移动通信有限公司研究院 | The filter method and device of a kind of refuse messages |
CN109446527A (en) * | 2018-10-26 | 2019-03-08 | 广东小天才科技有限公司 | A kind of analysis method and system of meaningless corpus |
CN109471920A (en) * | 2018-11-19 | 2019-03-15 | 北京锐安科技有限公司 | A kind of method, apparatus of Text Flag, electronic equipment and storage medium |
CN109922444A (en) * | 2017-12-13 | 2019-06-21 | 中国移动通信集团公司 | A kind of refuse messages recognition methods and device |
CN110968687A (en) * | 2018-09-30 | 2020-04-07 | 北京国双科技有限公司 | Method and device for classifying texts |
CN111310452A (en) * | 2018-12-12 | 2020-06-19 | 北京京东尚科信息技术有限公司 | Word segmentation method and device |
CN111339250A (en) * | 2020-02-20 | 2020-06-26 | 北京百度网讯科技有限公司 | Mining method of new category label, electronic equipment and computer readable medium |
CN112714447A (en) * | 2020-12-22 | 2021-04-27 | 南京翼启莱信息技术有限公司 | Platform short message purification method based on mobile phone number and short message content dual-mode detection |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101094135A (en) * | 2006-06-23 | 2007-12-26 | 腾讯科技(深圳)有限公司 | Method and system for extracting information of content in Internet |
CN101184259A (en) * | 2007-11-01 | 2008-05-21 | 浙江大学 | Keyword automatically learning and updating method in rubbish short message |
CN101257671A (en) * | 2007-07-06 | 2008-09-03 | 浙江大学 | Method for real time filtering large scale rubbish SMS based on content |
-
2009
- 2009-01-16 CN CN200910077123A patent/CN101784022A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101094135A (en) * | 2006-06-23 | 2007-12-26 | 腾讯科技(深圳)有限公司 | Method and system for extracting information of content in Internet |
CN101257671A (en) * | 2007-07-06 | 2008-09-03 | 浙江大学 | Method for real time filtering large scale rubbish SMS based on content |
CN101184259A (en) * | 2007-11-01 | 2008-05-21 | 浙江大学 | Keyword automatically learning and updating method in rubbish short message |
Non-Patent Citations (1)
Title |
---|
赵治国,谭敏生,李志敏: "基于改进贝叶斯的垃圾邮件过滤算法综述", 《南华大学学报(自然科学版)》 * |
Cited By (50)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102547623A (en) * | 2010-12-08 | 2012-07-04 | 中国电信股份有限公司 | Junk short message processing method and system |
CN102547623B (en) * | 2010-12-08 | 2015-05-20 | 中国电信股份有限公司 | Junk short message processing method and system |
CN102790752A (en) * | 2011-05-20 | 2012-11-21 | 盛乐信息技术(上海)有限公司 | Fraud information filtering system and method on basis of feature identification |
CN102421074B (en) * | 2011-07-26 | 2017-05-10 | 中兴通讯股份有限公司 | Short message monitoring method and device |
CN102368842A (en) * | 2011-10-12 | 2012-03-07 | 中国联合网络通信集团有限公司 | Detection method of abnormal behavior of mobile terminal and detection system thereof |
CN102368842B (en) * | 2011-10-12 | 2013-03-20 | 中国联合网络通信集团有限公司 | Detection method of abnormal behavior of mobile terminal and detection system thereof |
CN103049454A (en) * | 2011-10-16 | 2013-04-17 | 同济大学 | Chinese and English search result visualization system based on multi-label classification |
CN103577406A (en) * | 2012-07-19 | 2014-02-12 | 深圳中兴网信科技有限公司 | Method and device for managing unstructured data |
CN103577406B (en) * | 2012-07-19 | 2019-04-16 | 深圳中兴网信科技有限公司 | A kind of method and device managing unstructured data |
CN103024746A (en) * | 2012-12-30 | 2013-04-03 | 清华大学 | System and method for processing spam short messages for telecommunication operator |
CN103024746B (en) * | 2012-12-30 | 2015-06-17 | 清华大学 | System and method for processing spam short messages for telecommunication operator |
CN103067896A (en) * | 2013-01-17 | 2013-04-24 | 中国联合网络通信集团有限公司 | Junk short message filtering method and device |
CN103067896B (en) * | 2013-01-17 | 2015-08-19 | 中国联合网络通信集团有限公司 | Method for filtering spam short messages and device |
CN103336766A (en) * | 2013-07-04 | 2013-10-02 | 微梦创科网络科技(中国)有限公司 | Short text garbage identification and modeling method and device |
CN103336766B (en) * | 2013-07-04 | 2016-12-28 | 微梦创科网络科技(中国)有限公司 | Short text garbage identification and modeling method and device |
CN103425777B (en) * | 2013-08-15 | 2016-12-28 | 北京大学 | A kind of based on the short message intelligent classification and the searching method that improve Bayes's classification |
CN103425777A (en) * | 2013-08-15 | 2013-12-04 | 北京大学 | Intelligent short message classification and searching method based on improved Bayesian classification |
WO2015032123A1 (en) * | 2013-09-04 | 2015-03-12 | 盈世信息科技(北京)有限公司 | Method and device for extracting number from e-mail |
CN103455754A (en) * | 2013-09-05 | 2013-12-18 | 上海交通大学 | Regular expression-based malicious search keyword recognition method |
CN103455754B (en) * | 2013-09-05 | 2016-05-04 | 上海交通大学 | A kind of malicious searches keyword recognition methods based on regular expression |
CN104469709A (en) * | 2013-09-13 | 2015-03-25 | 联想(北京)有限公司 | Method for recognizing short message and electronic equipment |
CN103888921A (en) * | 2013-09-21 | 2014-06-25 | 天津思博科科技发展有限公司 | Short message intelligent deleting module |
CN104714938B (en) * | 2013-12-12 | 2017-12-29 | 联想(北京)有限公司 | The method and electronic equipment of a kind of information processing |
CN104714938A (en) * | 2013-12-12 | 2015-06-17 | 联想(北京)有限公司 | Message processing method and electronic device |
CN103702301A (en) * | 2013-12-31 | 2014-04-02 | 大连环宇移动科技有限公司 | Real-time sensing control system for inter-internet short message service |
CN104010284A (en) * | 2014-05-30 | 2014-08-27 | 可牛网络技术(北京)有限公司 | Method and device for processing spam short message |
CN105205079A (en) * | 2014-06-26 | 2015-12-30 | 联想(北京)有限公司 | Information processing method and electronic equipment |
CN105282720A (en) * | 2014-07-23 | 2016-01-27 | 中国移动通信集团重庆有限公司 | Junk short message filtering method and device |
CN105282720B (en) * | 2014-07-23 | 2018-12-04 | 中国移动通信集团重庆有限公司 | A kind of method for filtering spam short messages and device |
CN104168548A (en) * | 2014-08-21 | 2014-11-26 | 北京奇虎科技有限公司 | Short message intercepting method and device and cloud server |
CN105138611A (en) * | 2015-08-07 | 2015-12-09 | 北京奇虎科技有限公司 | Short message type identification method and device |
WO2016177148A1 (en) * | 2015-08-18 | 2016-11-10 | 中兴通讯股份有限公司 | Short message interception method and device |
CN106202330A (en) * | 2016-07-01 | 2016-12-07 | 北京小米移动软件有限公司 | The determination methods of junk information and device |
CN108093376A (en) * | 2016-11-21 | 2018-05-29 | 中国移动通信有限公司研究院 | The filter method and device of a kind of refuse messages |
CN106934008B (en) * | 2017-02-15 | 2020-07-21 | 北京时间股份有限公司 | Junk information identification method and device |
CN106934008A (en) * | 2017-02-15 | 2017-07-07 | 北京时间股份有限公司 | A kind of recognition methods of junk information and device |
CN107168951A (en) * | 2017-05-10 | 2017-09-15 | 山东大学 | A kind of rule-based prison inmates short message automatic auditing method with dictionary |
CN107548027A (en) * | 2017-07-28 | 2018-01-05 | 中国移动通信集团江苏有限公司 | Data push method, device, equipment and computer-readable storage medium |
CN109922444B (en) * | 2017-12-13 | 2020-11-03 | 中国移动通信集团公司 | Spam message identification method and device |
CN109922444A (en) * | 2017-12-13 | 2019-06-21 | 中国移动通信集团公司 | A kind of refuse messages recognition methods and device |
CN110968687A (en) * | 2018-09-30 | 2020-04-07 | 北京国双科技有限公司 | Method and device for classifying texts |
CN110968687B (en) * | 2018-09-30 | 2023-06-16 | 北京国双科技有限公司 | Method and device for classifying text |
CN109446527A (en) * | 2018-10-26 | 2019-03-08 | 广东小天才科技有限公司 | A kind of analysis method and system of meaningless corpus |
CN109446527B (en) * | 2018-10-26 | 2023-10-20 | 广东小天才科技有限公司 | Nonsensical corpus analysis method and system |
CN109471920A (en) * | 2018-11-19 | 2019-03-15 | 北京锐安科技有限公司 | A kind of method, apparatus of Text Flag, electronic equipment and storage medium |
CN111310452A (en) * | 2018-12-12 | 2020-06-19 | 北京京东尚科信息技术有限公司 | Word segmentation method and device |
CN111339250A (en) * | 2020-02-20 | 2020-06-26 | 北京百度网讯科技有限公司 | Mining method of new category label, electronic equipment and computer readable medium |
CN111339250B (en) * | 2020-02-20 | 2023-08-18 | 北京百度网讯科技有限公司 | Mining method for new category labels, electronic equipment and computer readable medium |
US11755654B2 (en) | 2020-02-20 | 2023-09-12 | Beijing Baidu Netcom Science Technology Co., Ltd. | Category tag mining method, electronic device and non-transitory computer-readable storage medium |
CN112714447A (en) * | 2020-12-22 | 2021-04-27 | 南京翼启莱信息技术有限公司 | Platform short message purification method based on mobile phone number and short message content dual-mode detection |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101784022A (en) | Method and system for filtering and classifying short messages | |
CN103024746B (en) | System and method for processing spam short messages for telecommunication operator | |
CN106550155B (en) | Swindle sample is carried out to suspicious number and screens the method and system sorted out and intercepted | |
CN101257671B (en) | Method for real time filtering large scale rubbish SMS based on content | |
CN101184259B (en) | Keyword automatically learning and updating method in rubbish short message | |
CN102591854B (en) | For advertisement filtering system and the filter method thereof of text feature | |
CN103037339B (en) | One kind is based on the short message filter method of " user's credit worthiness and short message spam degree " | |
CN102096703B (en) | Filtering method and equipment of short messages | |
CN107908715A (en) | Microblog emotional polarity discriminating method based on Adaboost and grader Weighted Fusion | |
CN101996241A (en) | Bayesian algorithm-based content filtering method | |
CN107609103A (en) | It is a kind of based on push away spy event detecting method | |
CN103634473A (en) | Naive Bayesian classification based mobile phone spam short message filtering method and system | |
CN102088697A (en) | Method and system for processing spam | |
CN109947934B (en) | Data mining method and system for short text | |
CN105589845A (en) | Junk text recognizing method, device and system | |
CN103108290A (en) | Short message handling method and device | |
CN106649338B (en) | Information filtering strategy generation method and device | |
CN101329668A (en) | Method and apparatus for generating information regulation and method and system for judging information types | |
KR20060087735A (en) | System and method for proceeding improved spam message filtering | |
CN110059189B (en) | Game platform message classification system and method | |
CN112492606A (en) | Classification and identification method and device for spam messages, computer equipment and storage medium | |
CN110232159A (en) | A kind of public sentiment intelligent analysis method based on big data | |
Li et al. | A Vector Space Model based spam SMS filter | |
CN112380323A (en) | Junk information removing system and method based on Chinese word segmentation recognition technology | |
CN105404670B (en) | Harass short message method of discrimination and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20100721 |