CN103037339A - Short message filtering method based on user creditworthiness and short message spam degree - Google Patents

Short message filtering method based on user creditworthiness and short message spam degree Download PDF

Info

Publication number
CN103037339A
CN103037339A CN201210580601.5A CN201210580601A CN103037339A CN 103037339 A CN103037339 A CN 103037339A CN 201210580601 A CN201210580601 A CN 201210580601A CN 103037339 A CN103037339 A CN 103037339A
Authority
CN
China
Prior art keywords
user
short message
credit worthiness
note
degree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201210580601.5A
Other languages
Chinese (zh)
Other versions
CN103037339B (en
Inventor
杨东洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHENZHEN CITY RICHINFO TECHNOLOGY Co Ltd
Original Assignee
SHENZHEN CITY RICHINFO TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHENZHEN CITY RICHINFO TECHNOLOGY Co Ltd filed Critical SHENZHEN CITY RICHINFO TECHNOLOGY Co Ltd
Priority to CN201210580601.5A priority Critical patent/CN103037339B/en
Publication of CN103037339A publication Critical patent/CN103037339A/en
Application granted granted Critical
Publication of CN103037339B publication Critical patent/CN103037339B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a short message filtering method based on user creditworthiness and short message spam degree. The short message filtering method includes the following steps: a step A is that initial creditworthiness is given to each short message service user according to active conditions of the short message service user; a step B is text preprocessing, wherein normal punctuations are eliminated first, disturbing characters set by a system are identified, the number of the disturbing characters is recorded, the disturbing characters are eliminated, and digits with specific codes and pictographic codes are used for replacing the disturbing characters; a step C is that a cell phone number and a uniform resource locator (URL) address are extracted, and correlation behavior features of a short message are extracted; a step D is that the essential attribute of the spam degree is newly added to a keyword, the keyword is matched based on contents which are obtained after the text preprocessing in the step B, and all matched keywords are recorded; a step E is that similar contents are defined, and the short message spam degree is calculated based on the similarity; and a step F is that whether interception is conducted is judged by combination of the user creditworthiness and the short message spam degree. Due to the fact that the short message filtering method is based on the user creditworthiness and the short message spam degree, filtering of spam short messages can be realized more accurately, and misjudgments of the spam short messages can be reduced.

Description

A kind of short message filter method based on " user's credit worthiness and note rubbish degree "
Technical field
The invention belongs to the short message treatment technology in Internet communication technology field, specifically, the method that the disclosed propagating contents that the sms platform that relates to a kind of Internet communication system is submitted to the user based on the short message filter method of " user's credit worthiness and note rubbish degree " is supervised and filtered.
Background technology
In recent years, along with the high speed development of mailbox service, some lawless persons that send refuse messages utilized specially the proprietary free text message passage of some mailboxes (such as 139 mailboxes etc.) as the instrument of accumulating wealth by unfair means or reached hidden purpose.Note, has also been grown in a large number to propagate the flames such as obscene pornographic, commercial swindle and commercial advertisement as the refuse messages of purpose for people provide cheap and easily communication service as one of value-added service of mobile communication simultaneously.These refuse messages severe jammings people's lives, harmed social safety, the supervision problem of refuse messages has been subject to the extensive attention of various circles of society.Except from the reinforcement of legislation aspect is supervised releasing news, the more important thing is at technological layer and explore the effective precautionary technology of note rubbish filtering.
In the prior art, the filter method of refuse messages mainly contains two kinds: based on keyword or content-based note rubbish filtering.
Filtering junk short messages based on keyword is that system arranges some keywords in advance, as long as occur these keywords in the short message content, then regards as refuse messages and is tackled, and this method basis for estimation is single, can have the defective of a large amount of erroneous judgements.
Content-based filtering junk short messages is to adopt machine learning that note is divided into normal note and refuse messages.Be used at present SMS classified machine learning method and mainly contain Bayes, SVM, KNN and artificial neural net etc.Also there is the defective of erroneous judgement in this filter method.
Summary of the invention
The method that the object of the present invention is to provide a kind of disclosed propagating contents of the user being submitted to based on the short message filter method of " user's credit worthiness and note rubbish degree " to supervise and filter.
For achieving the above object, the short message filter method based on " user's credit worthiness and note rubbish degree " of the present invention comprises step:
A) according to the situation of enlivening of short-message users, give initial credit worthiness of each user;
B) text preliminary treatment: reject first normal punctuation mark in the text, the interference character record number that the system that identifies arranges is also rejected, and replaces numeral and the pictographic code of specific coding;
C) extract phone number and URL address, carry out the feature extraction of note corelation behaviour;
D) the newly-increased rubbish degree base attribute of keyword is based on B) the pretreated content of step text does the keyword coupling, and each keyword of arriving of record matching;
E) similar content defines, and calculates short breath rubbish degree based on similarity;
F) in conjunction with user's credit worthiness and short breath rubbish degree, judge whether interception.
The object of the invention is to short message content and user behavior are carried out comprehensive marking, form and make a concerted effort, determine whether refuse messages in conjunction with user's credit worthiness again, the interception of as much as possible catching rubbish note, and reduction mistake is on high prestige user's impact.
The present invention gives initial credit worthiness of each user according to user's the situation of enlivening, and adopts hadoop to press a day extraction user again and uses each professional behavior counting, real-time servicing user credit worthiness.
Then carry out the text preliminary treatment.Reject first in the text normal punctuation mark, the interference character that the system that identifies arranges (such as ぁ etc.) record number is also rejected, replace the numeral of specific coding and pictographic code (as 4., 〇).
Based on the content after the second step processing, extract phone number and URL address, and judge whether phone number is original string content.Send user self behavioural characteristic and extract, as: different-place login, new registration user, note issue mortality high (extendible).Similar Content Feature Extraction, as: the distribution of sender area, sender login (extendible) such as IP distribution, recipient's area distribution, transmission frequency.Based on the feature calculation rubbish degree that extracts, carry out refuse messages identification.
Keyword increases rubbish degree base attribute newly, does keyword coupling based on the pretreated content of text, and each keyword of arriving of record matching.Keyword based on coupling calculates the rubbish degree, gathers simultaneously the result of the 3rd step clearing, carries out the refuse messages identifying processing.
Similar content defines.Calculate the rubbish degree based on similarity, and gather the result in the 4th step, carry out the refuse messages identifying processing.
In conjunction with user's credit worthiness and note rubbish degree, judge whether interception.The rubbish degree is moderate, and the note that allows the user to issue, and carries out simultaneously user's credit worthiness deduction.
The present invention is based on user's credit worthiness and note rubbish degree and can realize more accurately filtration to the short breath of rubbish, reduce the erroneous judgement of the short breath of rubbish.
Description of drawings
Fig. 1 is that a kind of embodiment of the present invention is to the flow chart of filtering junk short messages;
Fig. 2 is the flow chart that a kind of embodiment of the present invention is safeguarded user's credit worthiness;
Fig. 3 is the flow chart of the embodiment of text pre-treatment step shown in Figure 1;
Fig. 4 is the flow chart of the embodiment of behavioural characteristic treatment step shown in Figure 1;
Fig. 5 is the flow chart of the embodiment of keyword coupling step shown in Figure 1;
Fig. 6 is the flow chart that similarity shown in Figure 1 defines the embodiment of step;
Fig. 7 is the flow chart of the embodiment of doubtful refuse messages treatment step shown in Figure 1.
Embodiment
Below in conjunction with the drawings and specific embodiments the present invention is described in further details.
Fig. 1-Fig. 7 is that a kind of embodiment of the present invention is to the flow chart of filtering junk short messages.In this example, rubbish filtering method of the present invention incorporated and be embodied in characteristic processing step, keyword treatment step and similarity define, and in normal note handling process, doubtful garbage disposal flow process and the refuse messages handling process.Normal note handling process, doubtful refuse messages handling process and refuse messages handling process mainly are to safeguard for user's credit worthiness to provide main data supporting.
In this example, rubbish filtering method of the present invention will be given a mark according to the Word message of note and feature and be determined whether filter method into refuse messages, adopt successively behavioural characteristic processing, keyword coupling and similarity to define the combination of three kinds of methods, improve the accuracy that refuse messages is judged.
Simultaneously, in this example, rubbish filtering method of the present invention also combines black/white list filter method, and namely black list user's credit worthiness is 0 to forbid sending any note, and white list user credit worthiness is that the note that 1 acquiescence sends is normally.
The below is described in detail five handling processes.
User's credit worthiness maintenance process-" this flow process comprises credit worthiness initialization, unlawful practice deduction credit worthiness and cumulative credit worthiness three parts of the behavior that enlivens.Wherein deduct the credit worthiness unlawful practice and comprise the submission refuse messages and issue doubtful refuse messages, adopt the in real time mode of deduction; The cumulative credit worthiness of the behavior that enlivens adopts the mode of hadoop timing analysis to carry out; Credit worthiness initialization rule:
Figure BDA00002670938000041
The behavioural characteristic handling process-" this flow process mainly is to extract the relevant behavioural characteristic of note, generally comprise such as the commercial paper note and to be mingled with the character that disturbs character or adopt specific coding in phone number or URL address, the key content (as 4./⒀), refuse messages also possesses the characteristic of mass-sending simultaneously, therefore be necessary that also similar content is carried out IP distribution, recipient's Regional Distribution, sender's Regional Distribution etc. to be analyzed, gather above-mentioned information note is carried out the calculating of rubbish degree, then take a decision as to whether refuse messages.In the intermediate treatment process, only whether identification is refuse messages, only judges namely whether the rubbish degree surpasses the predetermined threshold values of refuse messages.If be judged as refuse messages, then carry out the credit worthiness deduction of points.
Keyword matching treatment flow process-" at first with common keyword, combination keyword and responsive key definition rubbish value attribute, then this flow process is done the keyword coupling with regard to pretreated text, keyword on the coupling is carried out the rubbish degree calculate, simultaneously cumulative total rubbish degree before.Determine whether refuse messages based on the rubbish degree at last.In the intermediate treatment process, only whether identification is refuse messages, only judges namely whether the rubbish degree surpasses the predetermined threshold values of refuse messages.Certainly in the keyword matching process, also can adopt the original contents string to do the canonical coupling.
Similarity define handling process-" this flow process is for for the historical refuse messages of having tackled, do fingerprint similarity coupling, calculate maximum similarity, simultaneously cumulative greater than doing of certain value similarity, and the association attributes conversion of extracting is rubbish degree (also can adopt bayesian algorithm to come text is classified), the total rubbish degree before cumulative simultaneously.Total rubbish degree is lower than doubtful rubbish threshold values, then directly processes as normal note, is higher than the refuse messages threshold values, then is judged to be refuse messages, otherwise, carry out doubtful garbage disposal flow process.
Doubtful garbage disposal flow process-" do judgement based on user's credit worthiness, processing mode is as follows:
Figure BDA00002670938000051
Mandate for doubtful refuse messages issues, and does the credit worthiness deduction according to user's credit worthiness and doubtful rubbish degree, and computing formula is following, and (the supposition credit worthiness is divided the n shelves, adopts C1, C2 ... Cn represents, the C1 maximum; T1, T2 ... Tn represents to allow between the credit worthiness stepping number of transmission; B1, B2 ... each class rubbish contribution fiducial value of Bn; G is the rubbish degree):
Credit worthiness deduction value=(C1-C2)/T1* (G/B1).
The present invention is based on user's credit worthiness and note rubbish degree and can realize more accurately filtration to the short breath of rubbish, reduce the erroneous judgement of the short breath of rubbish.
Above content is in conjunction with concrete preferred implementation further description made for the present invention, can not assert that implementation of the present invention is confined to these explanations.For the general technical staff of the technical field of the invention, without departing from the inventive concept of the premise, can also make some simple deduction or replace, all should be considered as belonging to protection scope of the present invention.

Claims (6)

1. short message filter method based on " user's credit worthiness and note rubbish degree " comprises step:
A) according to the situation of enlivening of short-message users, give initial credit worthiness of each user;
B) text preliminary treatment: reject first normal punctuation mark in the text, the interference character record number that the system that identifies arranges is also rejected, and replaces numeral and the pictographic code of specific coding;
C) extract phone number and URL address, carry out the feature extraction of note corelation behaviour;
D) the newly-increased rubbish degree base attribute of keyword is based on B) the pretreated content of step text does the keyword coupling, and each keyword of arriving of record matching;
E) similar content defines, and calculates short breath rubbish degree based on similarity;
F) in conjunction with user's credit worthiness and short breath rubbish degree, judge whether interception.
2. the short message filter method based on " user's credit worthiness and note rubbish degree " as claimed in claim 1, it is characterized in that: after described A) step provides the initial credit worthiness of user, adopting hadoop to use each professional behavior counting, real-time servicing user credit worthiness by sky extraction user.
3. such as claim 1 or 2 described short message filter methods based on " user's credit worthiness and note rubbish degree ", it is characterized in that: described note corelation behaviour feature comprises user self behavioural characteristic and similar content characteristic.
4. the short message filter method based on " user's credit worthiness and note rubbish degree " as claimed in claim 3, it is characterized in that: described family self behavioural characteristic comprises that different-place login, new registration user, note issue mortality.
5. the short message filter method based on " user's credit worthiness and note rubbish degree " as claimed in claim 3 is characterized in that: described similar content characteristic comprises that the sender area distributes, the sender logins that IP distributes, the recipient area distributes, transmission frequency.
6. the short message filter method based on " user's credit worthiness and note rubbish degree " as claimed in claim 3, it is characterized in that: described real-time servicing user credit worthiness comprises: the initialization of reputation degree, unlawful practice deduction credit worthiness and cumulative credit worthiness three parts of the behavior that enlivens.
CN201210580601.5A 2012-12-28 2012-12-28 One kind is based on the short message filter method of " user's credit worthiness and short message spam degree " Active CN103037339B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210580601.5A CN103037339B (en) 2012-12-28 2012-12-28 One kind is based on the short message filter method of " user's credit worthiness and short message spam degree "

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210580601.5A CN103037339B (en) 2012-12-28 2012-12-28 One kind is based on the short message filter method of " user's credit worthiness and short message spam degree "

Publications (2)

Publication Number Publication Date
CN103037339A true CN103037339A (en) 2013-04-10
CN103037339B CN103037339B (en) 2017-11-17

Family

ID=48023735

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210580601.5A Active CN103037339B (en) 2012-12-28 2012-12-28 One kind is based on the short message filter method of " user's credit worthiness and short message spam degree "

Country Status (1)

Country Link
CN (1) CN103037339B (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103607705A (en) * 2013-12-04 2014-02-26 北京网秦天下科技有限公司 Junk message filtering method and engine
CN103619001A (en) * 2013-11-01 2014-03-05 宇龙计算机通信科技(深圳)有限公司 Short message processing method, device and mobile terminal
CN103957516A (en) * 2014-05-13 2014-07-30 北京网秦天下科技有限公司 Junk short message filtering method and engine
CN104185158A (en) * 2014-09-01 2014-12-03 北京奇虎科技有限公司 Malicious short message processing method and client based on false base station
CN104462062A (en) * 2014-12-11 2015-03-25 珠海金山网络游戏科技有限公司 Text anti-spam method
CN104615585A (en) * 2014-01-06 2015-05-13 腾讯科技(深圳)有限公司 Text information processing method and device
CN104756438A (en) * 2012-08-22 2015-07-01 国际商业机器公司 Cooperative intrusion detection ecosystem for IP reputation-based security
CN105163296A (en) * 2015-09-22 2015-12-16 电子科技大学 Multi-dimensional spam message filtering method and system
CN105704689A (en) * 2016-01-12 2016-06-22 深圳市深讯数据科技股份有限公司 Big data acquisition and analysis method and system of short message behaviors
CN105813085A (en) * 2016-03-08 2016-07-27 联想(北京)有限公司 Information processing method and electronic device
US9755616B2 (en) 2014-06-30 2017-09-05 Huawei Technologies Co., Ltd. Method and apparatus for data filtering, and method and apparatus for constructing data filter
CN107809368A (en) * 2016-09-09 2018-03-16 腾讯科技(深圳)有限公司 Information filtering method and device
CN107809410A (en) * 2016-09-09 2018-03-16 腾讯科技(深圳)有限公司 Information filtering method and device
CN108055289A (en) * 2018-01-30 2018-05-18 深圳市富途网络科技有限公司 A kind of method and system audited to user-generated content based on internet
CN109286667A (en) * 2018-09-25 2019-01-29 北京点网聚科技有限公司 User account management method and device
CN113344599A (en) * 2021-06-30 2021-09-03 中国光大银行股份有限公司 Fraud short message identification method and system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101150756B (en) * 2007-11-08 2010-05-19 电子科技大学 A spam filtering method
CN101459718B (en) * 2009-01-06 2012-05-23 华中科技大学 Rubbish voice filtering method based on mobile communication network and system thereof
CN102045652B (en) * 2009-10-21 2013-04-17 深圳市彩讯科技有限公司 Garbage short message interception method based on characteristic similarity
CN101715192B (en) * 2009-11-12 2014-09-03 成都市华为赛门铁克科技有限公司 Harassing call filtering method, device and system

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104756438A (en) * 2012-08-22 2015-07-01 国际商业机器公司 Cooperative intrusion detection ecosystem for IP reputation-based security
CN104756438B (en) * 2012-08-22 2017-09-29 国际商业机器公司 For the cooperation intrusion detection ecosystem based on IP reputations safety
CN103619001A (en) * 2013-11-01 2014-03-05 宇龙计算机通信科技(深圳)有限公司 Short message processing method, device and mobile terminal
CN103607705B (en) * 2013-12-04 2016-09-21 北京网秦天下科技有限公司 Method for filtering spam short messages and engine
CN103607705A (en) * 2013-12-04 2014-02-26 北京网秦天下科技有限公司 Junk message filtering method and engine
US11151176B2 (en) 2014-01-06 2021-10-19 Tencent Technology (Shenzhen) Company Limited Method and apparatus for processing text information
US10387460B2 (en) 2014-01-06 2019-08-20 Tencent Technology (Shenzhen) Company Limited Method and apparatus for processing text information
CN104615585A (en) * 2014-01-06 2015-05-13 腾讯科技(深圳)有限公司 Text information processing method and device
CN104615585B (en) * 2014-01-06 2017-07-21 腾讯科技(深圳)有限公司 Handle the method and device of text message
CN103957516A (en) * 2014-05-13 2014-07-30 北京网秦天下科技有限公司 Junk short message filtering method and engine
US9755616B2 (en) 2014-06-30 2017-09-05 Huawei Technologies Co., Ltd. Method and apparatus for data filtering, and method and apparatus for constructing data filter
CN104185158A (en) * 2014-09-01 2014-12-03 北京奇虎科技有限公司 Malicious short message processing method and client based on false base station
CN104462062A (en) * 2014-12-11 2015-03-25 珠海金山网络游戏科技有限公司 Text anti-spam method
CN105163296A (en) * 2015-09-22 2015-12-16 电子科技大学 Multi-dimensional spam message filtering method and system
CN105704689A (en) * 2016-01-12 2016-06-22 深圳市深讯数据科技股份有限公司 Big data acquisition and analysis method and system of short message behaviors
CN105813085A (en) * 2016-03-08 2016-07-27 联想(北京)有限公司 Information processing method and electronic device
CN107809368A (en) * 2016-09-09 2018-03-16 腾讯科技(深圳)有限公司 Information filtering method and device
CN107809368B (en) * 2016-09-09 2019-01-29 腾讯科技(深圳)有限公司 Information filtering method and device
CN107809410B (en) * 2016-09-09 2019-03-08 腾讯科技(深圳)有限公司 Information filtering method and device
CN107809410A (en) * 2016-09-09 2018-03-16 腾讯科技(深圳)有限公司 Information filtering method and device
CN108055289A (en) * 2018-01-30 2018-05-18 深圳市富途网络科技有限公司 A kind of method and system audited to user-generated content based on internet
CN109286667A (en) * 2018-09-25 2019-01-29 北京点网聚科技有限公司 User account management method and device
CN109286667B (en) * 2018-09-25 2022-07-01 北京一点网聚科技有限公司 User account management method and device
CN113344599A (en) * 2021-06-30 2021-09-03 中国光大银行股份有限公司 Fraud short message identification method and system

Also Published As

Publication number Publication date
CN103037339B (en) 2017-11-17

Similar Documents

Publication Publication Date Title
CN103037339A (en) Short message filtering method based on user creditworthiness and short message spam degree
CN101784022A (en) Method and system for filtering and classifying short messages
CN105072137B (en) The detection method of spear type fishing mail and device
CN103024746B (en) System and method for processing spam short messages for telecommunication operator
CN104982011B (en) Use the document classification of multiple dimensioned text fingerprints
CN102591854B (en) For advertisement filtering system and the filter method thereof of text feature
CN104899508B (en) A kind of multistage detection method for phishing site and system
CN111159387B (en) Recommendation method based on multi-dimensional alarm information text similarity analysis
CN101257671A (en) Method for real time filtering large scale rubbish SMS based on content
CN104598595B (en) Cheat page detection method and related device
CN107609103A (en) It is a kind of based on push away spy event detecting method
CN104462509A (en) Review spam detection method and device
CN101159704A (en) Microcontent similarity based antirubbish method
CN103064987A (en) Bogus transaction information identification method
US20150113651A1 (en) Spammer group extraction apparatus and method
CN104156447A (en) Intelligent social platform advertisement early warning and handling method
CN107895122A (en) A kind of special sensitive information active defense method, apparatus and system
CN107370655A (en) A kind of method for filtering short message and system
CN113297283A (en) Public opinion analysis method and system for enterprise risk early warning
CN105589845A (en) Junk text recognizing method, device and system
CN108023868A (en) Malice resource address detection method and device
CN103345530B (en) A kind of social networks blacklist automatic fitration model based on semantic net
Liu et al. SDHM: A hybrid model for spammer detection in Weibo
CN108959368A (en) A kind of information monitoring method, storage medium and server
CN106681980A (en) Method and device for analyzing junk short messages

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 4, 01-11 building, 518000 / F, Changhong technology building, 18 South twelve Road, Nanshan District, Guangdong, Shenzhen

Applicant after: Polytron Technologies Inc

Address before: 4, 01-11 building, 518000 / F, Changhong technology building, 18 South twelve Road, Nanshan District, Guangdong, Shenzhen

Applicant before: Shenzhen City Richinfo Technology Co., Ltd.

COR Change of bibliographic data
GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder
CP02 Change in the address of a patent holder

Address after: 518000 31 / F, Caixun science and technology building, No. 3176, Keyuan South Road, community, high tech Zone, Yuehai street, Nanshan District, Shenzhen City, Guangdong Province

Patentee after: RICHINFO TECHNOLOGY Co.,Ltd.

Address before: 4, 01-11 building, 518000 / F, Changhong technology building, 18 South twelve Road, Nanshan District, Guangdong, Shenzhen

Patentee before: RICHINFO TECHNOLOGY Co.,Ltd.