CN103037339B - One kind is based on the short message filter method of " user's credit worthiness and short message spam degree " - Google Patents

One kind is based on the short message filter method of " user's credit worthiness and short message spam degree " Download PDF

Info

Publication number
CN103037339B
CN103037339B CN201210580601.5A CN201210580601A CN103037339B CN 103037339 B CN103037339 B CN 103037339B CN 201210580601 A CN201210580601 A CN 201210580601A CN 103037339 B CN103037339 B CN 103037339B
Authority
CN
China
Prior art keywords
user
credit worthiness
short message
short
degree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210580601.5A
Other languages
Chinese (zh)
Other versions
CN103037339A (en
Inventor
杨东洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
POLYTRON TECHNOLOGIES Inc
Original Assignee
POLYTRON TECHNOLOGIES Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by POLYTRON TECHNOLOGIES Inc filed Critical POLYTRON TECHNOLOGIES Inc
Priority to CN201210580601.5A priority Critical patent/CN103037339B/en
Publication of CN103037339A publication Critical patent/CN103037339A/en
Application granted granted Critical
Publication of CN103037339B publication Critical patent/CN103037339B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a kind of short message filter method for being based on " user's credit worthiness and short message spam degree ", including step:A)Situation is enlivened according to short-message users, to each one initial credit worthiness of user;B)Text Pretreatment:Normal punctuation mark in text is first rejected, the interference character record number that system is set is identified and rejects, replace the numeral and pictographic code of specific coding;C)Phone number and URL addresses are extracted, carries out short message corelation behaviour feature extraction;D)Keyword increases spam degree base attribute newly, based on B)Content after step Text Pretreatment does Keywords matching, and each keyword that record matching arrives;E)Similar content defines, based on the short breath spam degree of Similarity Measure;F)Credit worthiness and short breath spam degree with reference to user, judge whether to intercept.The present invention can more accurately realize the filtering to the short breath of rubbish based on user's credit worthiness and short message spam degree, reduce the erroneous judgement of the short breath of rubbish.

Description

One kind is based on the short message filter method of " user's credit worthiness and short message spam degree "
Technical field
The invention belongs to the short message treatment technology in Internet communication technology field, it relates in particular to a kind of internet The short message filter method that the sms platform of communication system is based on " user's credit worthiness and short message spam degree " is submitted to user The method for disclosing propagating contents and being supervised and being filtered.
Background technology
In recent years, with the high speed development of mailbox service, some criminals of refuse messages are sent exclusively with one A little mailboxes(Such as 139 mailboxes)Proprietary free text message passage is as the instrument accumulated wealth by unfair means or reaches hidden purpose.It is short Believe one of value-added service as mobile communication, provide cheap and easily communication service for people, while also grow The largely refuse messages for the purpose of propagating the flames such as obscene pornographic, business swindle and commercial advertisement.These refuse messages Severe jamming people's lives, have harmed social safety, the supervision problem of refuse messages is by the extensive attention of various circles of society. Except from legislation aspect strengthen to release news supervise it is outer, it is often more important that in technological layer exploration short message rubbish filtering row Effective precautionary technology.
In the prior art, the filter method of refuse messages mainly has two kinds:Short message rubbish based on keyword or based on content Rubbish filters.
Filtering junk short messages based on keyword are that system sets some keywords in advance, as long as there is this in short message content A little keywords, then regard as refuse messages and intercepted, this method basis for estimation is single, can be lacked in the presence of what is largely judged by accident Fall into.
Filtering junk short messages based on content are that short message is divided into normal short message and refuse messages using machine learning.At present Mainly there are Bayes, SVM, KNN and artificial neural network etc. for SMS classified machine learning method.The filter method is also deposited Erroneous judgement the defects of.
The content of the invention
It is an object of the invention to provide a kind of short message filter method for being based on " user's credit worthiness and short message spam degree " The method for disclosing propagating contents and being supervised and being filtered submitted to user.
To achieve the above object, the short message filtering side of the present invention based on " user's credit worthiness and short message spam degree " Method, including step:
A)Situation is enlivened according to short-message users, to each one initial credit worthiness of user;
B)Text Pretreatment:Normal punctuation mark in text is first rejected, the interference character record that the system that identifies is set Count and reject, replace the numeral and pictographic code of specific coding;
C)Phone number and URL addresses are extracted, carries out short message corelation behaviour feature extraction;
D)Keyword increases spam degree base attribute newly, based on B)Content after step Text Pretreatment does Keywords matching, And each keyword that record matching arrives;
E)Similar content defines, based on the short breath spam degree of Similarity Measure;
F)Credit worthiness and short breath spam degree with reference to user, judge whether to intercept.
It is an object of the invention to which short message content and user behavior are carried out into comprehensive marking, formation is made a concerted effort, in conjunction with Family credit worthiness determines whether refuse messages, as far as possible catching rubbish short message, and reduces and intercept by mistake to high prestige user's Influence.
The present invention enlivens situation according to user, is daily carried to each one initial credit worthiness of user, then using hadoop Take family to count using the behavior of each business, real-time servicing user's credit worthiness.
Then Text Pretreatment is carried out.Normal punctuation mark in text is first rejected, identifies the noise word that system is set Symbol (such as ぁ) record number is simultaneously rejected, and replaces the numeral and pictographic code (such as 4., 〇) of specific coding.
Based on the content after second step processing, phone number and URL addresses are extracted, and judge whether phone number is original String content.The extraction of user itself behavioural characteristic is sent, such as:Different-place login, new registration user, short message issue mortality height etc.(Can Expand).Similar content feature extraction, such as:Sender's Area distribution, sender log in IP distributions, recipient's Area distribution, sent Frequency etc.(It is extendible).Feature calculation spam degree based on extraction, carry out refuse messages identification.
Keyword increases spam degree base attribute newly, and Keywords matching, and record are done based on the content after Text Pretreatment Each keyword being fitted on.Keyword based on matching calculates spam degree, while collects the result of the 3rd step clearing, and it is short to carry out rubbish Believe identifying processing.
Similar content defines.Based on Similarity Measure spam degree, and collect the result of the 4th step, carry out refuse messages identification Processing.
With reference to the credit worthiness and short message spam degree of user, judge whether to intercept.Spam degree is moderate, and allows what user issued Short message, while carry out user's credit worthiness deduction.
The present invention can more accurately realize the filtering to the short breath of rubbish based on user's credit worthiness and short message spam degree, subtract The erroneous judgement of the short breath of few rubbish.
Brief description of the drawings
Fig. 1 is a kind of flow chart of the embodiment of the present invention to filtering junk short messages;
Fig. 2 is the flow chart that a kind of embodiment of the present invention is safeguarded to user's credit worthiness;
Fig. 3 is the flow chart of the embodiment of Text Pretreatment step shown in Fig. 1;
Fig. 4 is the flow chart of the embodiment of behavioural characteristic processing step shown in Fig. 1;
Fig. 5 is the flow chart of the embodiment of Keywords matching step shown in Fig. 1;
Fig. 6 is the flow chart for the embodiment that similarity shown in Fig. 1 defines step;
Fig. 7 is the flow chart of the embodiment of doubtful refuse messages processing step shown in Fig. 1.
Embodiment
The present invention is described in further details with specific embodiment below in conjunction with the accompanying drawings.
Fig. 1-Fig. 7 is a kind of flow chart of the embodiment of the present invention to filtering junk short messages.In this example, it incite somebody to action this Invention rubbish filtering method incorporates and is embodied in characteristic processing step, keyword processing step and similarity define, and normally In short message handling process, doubtful garbage disposal flow and refuse messages handling process.Normal short message handling process, doubtful rubbish are short Letter handling process and refuse messages handling process are mainly safeguarded for user's credit worthiness provides main data supporting.
In this example, rubbish filtering method of the present invention will be according to the determination that give a mark of the text information of short message and feature The no filter method for refuse messages, handled successively using behavioural characteristic, Keywords matching and similarity define three kinds of methods With reference to the accuracy that raising refuse messages judge.
Meanwhile in this example, rubbish filtering method of the present invention is also in relation with black/white list filter method, i.e. blacklist User's credit worthiness forbids sending any short message for 0, and white list user credit worthiness is that the short message that 1 acquiescence is sent is normal.
Five handling processes are described in detail below.
User's credit worthiness maintenance process-》The flow includes credit worthiness initialization, credit worthiness is deducted in unlawful practice and active The cumulative credit worthiness three parts of behavior.Wherein deduct that credit worthiness unlawful practice includes submitting refuse messages and to issue doubtful rubbish short Letter, by the way of deducting in real time;The cumulative credit worthiness of behavior is enlivened to carry out by the way of hadoop timing analysis;At the beginning of credit worthiness Beginningization rule:
Behavioural characteristic handling process-》The flow is mainly to extract the related behavioural characteristic of short message, such as commercial paper short message Generally comprise be mingled with phone number or URL addresses, key content interference character or using specific coding character (as 4./ (13)), while refuse messages also possess the characteristic of mass-sending, therefore are also necessary to carry out IP distributions, recipient region to Similar content Distribution, sender's Regional Distribution etc. are analyzed, and are collected above- mentioned information and are carried out spam degree calculating to short message, then determine whether Refuse messages.During intermediate treatment, only identify whether as refuse messages, i.e., only judge whether spam degree exceedes refuse messages Predetermined threshold values.If it is determined that refuse messages, then carry out credit worthiness deduction of points.
Keywords matching handling process-》General keyword, combination keyword and sensitive keys word are defined into rubbish value first Attribute, then the flow Keywords matching is done with regard to pretreated text, spam degree calculating is carried out to the keyword that matches, together When it is cumulative before total spam degree.Spam degree is finally based on to determine whether refuse messages.During intermediate treatment, only know Not whether to be not refuse messages, i.e., only judge whether spam degree exceedes the predetermined threshold values of refuse messages.Certainly in Keywords matching mistake Cheng Zhong, canonical matching can also be done using original contents string.
Similarity define handling process-》The flow is the history refuse messages for having intercepted, and does fingerprint similarity Match somebody with somebody, calculate maximum similarity, at the same similarity is cumulative more than doing for certain value, and be rubbish the association attributes conversion of extraction Rubbish degree (can also be classified) using bayesian algorithm to text, at the same it is cumulative before total spam degree.Total spam degree is less than Doubtful rubbish threshold values, then directly handled as normal short message, higher than refuse messages threshold values, be then determined as refuse messages, otherwise, hold The doubtful garbage disposal flow of row.
Doubtful garbage disposal flow-》Done and judged based on user's credit worthiness, processing mode is as follows:
Mandate for doubtful refuse messages is issued, and credit worthiness deduction is done according to user's credit worthiness and doubtful spam degree, meter Calculating formula, (it is assumed that credit worthiness divides n shelves, using C1, C2 ..., Cn is represented, C1 is maximum as follows;T1, T2 ... Tn represent credit worthiness stepping it Between allow send bar number;Each class rubbish contribution a reference values of B1, B2 ... Bn;G is spam degree):
Credit worthiness deduction value=(C1-C2)/T1* (G/B1).
The present invention can more accurately realize the filtering to the short breath of rubbish based on user's credit worthiness and short message spam degree, subtract The erroneous judgement of the short breath of few rubbish.
Above content is to combine specific preferred embodiment further description made for the present invention, it is impossible to is assert The specific implementation of the present invention is confined to these explanations.For general technical staff of the technical field of the invention, On the premise of not departing from present inventive concept, some simple deduction or replace can also be made, should all be considered as belonging to the present invention's Protection domain.

Claims (1)

1. one kind is based on the short message filter method of " user's credit worthiness and short message spam degree ", including step:
A)Situation is enlivened according to short-message users, to each one initial credit worthiness of user;After providing the initial credit worthiness of user, Daily extract user using hadoop to count using the behavior of each business, real-time servicing user's credit worthiness;The real-time servicing is used Family credit worthiness includes:The initialization of reputation degree, unlawful practice deduct credit worthiness and enliven the cumulative credit worthiness three parts of behavior;
B)Text Pretreatment:Normal punctuation mark in text is first rejected, identifies the interference character record number of system setting simultaneously Reject, replace the numeral and pictographic code of specific coding;
C)Phone number and URL addresses are extracted, carries out short message corelation behaviour feature extraction;The short message corelation behaviour feature includes User itself behavioural characteristic and Similar content feature;
D)Keyword increases spam degree base attribute newly, based on B)Content after step Text Pretreatment does Keywords matching, and remembers Record each keyword matched;
E)Similar content defines, based on the short breath spam degree of Similarity Measure;
F)Credit worthiness and short breath spam degree with reference to user, judge whether to intercept;
Itself behavioural characteristic of described family issues mortality including different-place login, new registration user, short message;Described Similar content Feature includes sender's Area distribution, sender logs in IP distributions, recipient's Area distribution, transmission frequency.
CN201210580601.5A 2012-12-28 2012-12-28 One kind is based on the short message filter method of " user's credit worthiness and short message spam degree " Active CN103037339B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210580601.5A CN103037339B (en) 2012-12-28 2012-12-28 One kind is based on the short message filter method of " user's credit worthiness and short message spam degree "

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210580601.5A CN103037339B (en) 2012-12-28 2012-12-28 One kind is based on the short message filter method of " user's credit worthiness and short message spam degree "

Publications (2)

Publication Number Publication Date
CN103037339A CN103037339A (en) 2013-04-10
CN103037339B true CN103037339B (en) 2017-11-17

Family

ID=48023735

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210580601.5A Active CN103037339B (en) 2012-12-28 2012-12-28 One kind is based on the short message filter method of " user's credit worthiness and short message spam degree "

Country Status (1)

Country Link
CN (1) CN103037339B (en)

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8925082B2 (en) * 2012-08-22 2014-12-30 International Business Machines Corporation Cooperative intrusion detection ecosystem for IP reputation-based security
CN103619001A (en) * 2013-11-01 2014-03-05 宇龙计算机通信科技(深圳)有限公司 Short message processing method, device and mobile terminal
CN103607705B (en) * 2013-12-04 2016-09-21 北京网秦天下科技有限公司 Method for filtering spam short messages and engine
CN104615585B (en) 2014-01-06 2017-07-21 腾讯科技(深圳)有限公司 Handle the method and device of text message
CN103957516A (en) * 2014-05-13 2014-07-30 北京网秦天下科技有限公司 Junk short message filtering method and engine
CN105224569B (en) 2014-06-30 2018-09-07 华为技术有限公司 A kind of data filtering, the method and device for constructing data filter
CN104185158A (en) * 2014-09-01 2014-12-03 北京奇虎科技有限公司 Malicious short message processing method and client based on false base station
CN104462062B (en) * 2014-12-11 2018-02-13 珠海金山网络游戏科技有限公司 A kind of method of text anti-spam
CN105163296A (en) * 2015-09-22 2015-12-16 电子科技大学 Multi-dimensional spam message filtering method and system
CN105704689A (en) * 2016-01-12 2016-06-22 深圳市深讯数据科技股份有限公司 Big data acquisition and analysis method and system of short message behaviors
CN105813085A (en) * 2016-03-08 2016-07-27 联想(北京)有限公司 Information processing method and electronic device
CN107809410B (en) * 2016-09-09 2019-03-08 腾讯科技(深圳)有限公司 Information filtering method and device
CN107809368B (en) * 2016-09-09 2019-01-29 腾讯科技(深圳)有限公司 Information filtering method and device
CN108055289A (en) * 2018-01-30 2018-05-18 深圳市富途网络科技有限公司 A kind of method and system audited to user-generated content based on internet
CN109286667B (en) * 2018-09-25 2022-07-01 北京一点网聚科技有限公司 User account management method and device
CN113344599A (en) * 2021-06-30 2021-09-03 中国光大银行股份有限公司 Fraud short message identification method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101150756A (en) * 2007-11-08 2008-03-26 电子科技大学 A spam filtering method
CN101459718A (en) * 2009-01-06 2009-06-17 华中科技大学 Rubbish voice filtering method based on mobile communication network and system thereof
CN101715192A (en) * 2009-11-12 2010-05-26 成都市华为赛门铁克科技有限公司 Harassing call filtering method, device and system
CN102045652A (en) * 2009-10-21 2011-05-04 深圳市彩讯科技有限公司 Garbage short message interception method based on characteristic similarity

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101150756A (en) * 2007-11-08 2008-03-26 电子科技大学 A spam filtering method
CN101459718A (en) * 2009-01-06 2009-06-17 华中科技大学 Rubbish voice filtering method based on mobile communication network and system thereof
CN102045652A (en) * 2009-10-21 2011-05-04 深圳市彩讯科技有限公司 Garbage short message interception method based on characteristic similarity
CN101715192A (en) * 2009-11-12 2010-05-26 成都市华为赛门铁克科技有限公司 Harassing call filtering method, device and system

Also Published As

Publication number Publication date
CN103037339A (en) 2013-04-10

Similar Documents

Publication Publication Date Title
CN103037339B (en) One kind is based on the short message filter method of " user's credit worthiness and short message spam degree "
CN105072137B (en) The detection method of spear type fishing mail and device
CN103176981B (en) A kind of event information excavates and the method for early warning
US7665140B2 (en) Fraudulent message detection
CN104982011B (en) Use the document classification of multiple dimensioned text fingerprints
CN103415004B (en) A kind of method and device detecting junk short message
CN104462509A (en) Review spam detection method and device
CN107370655A (en) A kind of method for filtering short message and system
CN110134849A (en) A kind of network public-opinion monitoring method and system
CN104598595B (en) Cheat page detection method and related device
CN101257671A (en) Method for real time filtering large scale rubbish SMS based on content
CN101159704A (en) Microcontent similarity based antirubbish method
EP2863592A1 (en) Spammer group extraction apparatus and method
CN108183888A (en) A kind of social engineering Network Intrusion path detection method based on random forests algorithm
CN107294834A (en) A kind of method and apparatus for recognizing spam
CN110839216B (en) Method and device for identifying communication information fraud
CN108023868A (en) Malice resource address detection method and device
Liu et al. SDHM: A hybrid model for spammer detection in Weibo
CN108959368A (en) A kind of information monitoring method, storage medium and server
CN106681980A (en) Method and device for analyzing junk short messages
CN106909534A (en) A kind of method and device for differentiating text-safe
CN103595614A (en) User feedback based junk mail detection method
CN112632387A (en) Big data-based policy information personalized customization pushing system
CN112035603A (en) Propagation influence evaluation method for comprehensive calculation event
CN113923011B (en) Phishing early warning method, device, computer equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 4, 01-11 building, 518000 / F, Changhong technology building, 18 South twelve Road, Nanshan District, Guangdong, Shenzhen

Applicant after: Polytron Technologies Inc

Address before: 4, 01-11 building, 518000 / F, Changhong technology building, 18 South twelve Road, Nanshan District, Guangdong, Shenzhen

Applicant before: Shenzhen City Richinfo Technology Co., Ltd.

COR Change of bibliographic data
GR01 Patent grant
GR01 Patent grant
CP02 Change in the address of a patent holder
CP02 Change in the address of a patent holder

Address after: 518000 31 / F, Caixun science and technology building, No. 3176, Keyuan South Road, community, high tech Zone, Yuehai street, Nanshan District, Shenzhen City, Guangdong Province

Patentee after: RICHINFO TECHNOLOGY Co.,Ltd.

Address before: 4, 01-11 building, 518000 / F, Changhong technology building, 18 South twelve Road, Nanshan District, Guangdong, Shenzhen

Patentee before: RICHINFO TECHNOLOGY Co.,Ltd.