CN103037339B

CN103037339B - One kind is based on the short message filter method of " user's credit worthiness and short message spam degree "

Info

Publication number: CN103037339B
Application number: CN201210580601.5A
Authority: CN
Inventors: 杨东洋
Original assignee: POLYTRON TECHNOLOGIES Inc
Current assignee: POLYTRON TECHNOLOGIES Inc
Priority date: 2012-12-28
Filing date: 2012-12-28
Publication date: 2017-11-17
Anticipated expiration: 2032-12-28
Also published as: CN103037339A

Abstract

The invention discloses a kind of short message filter method for being based on " user's credit worthiness and short message spam degree ", including step：A）Situation is enlivened according to short-message users, to each one initial credit worthiness of user；B）Text Pretreatment：Normal punctuation mark in text is first rejected, the interference character record number that system is set is identified and rejects, replace the numeral and pictographic code of specific coding；C）Phone number and URL addresses are extracted, carries out short message corelation behaviour feature extraction；D）Keyword increases spam degree base attribute newly, based on B）Content after step Text Pretreatment does Keywords matching, and each keyword that record matching arrives；E）Similar content defines, based on the short breath spam degree of Similarity Measure；F）Credit worthiness and short breath spam degree with reference to user, judge whether to intercept.The present invention can more accurately realize the filtering to the short breath of rubbish based on user's credit worthiness and short message spam degree, reduce the erroneous judgement of the short breath of rubbish.

Description

One kind is based on the short message filter method of " user's credit worthiness and short message spam degree "

Technical field

The invention belongs to the short message treatment technology in Internet communication technology field, it relates in particular to a kind of internet The short message filter method that the sms platform of communication system is based on " user's credit worthiness and short message spam degree " is submitted to user The method for disclosing propagating contents and being supervised and being filtered.

Background technology

In recent years, with the high speed development of mailbox service, some criminals of refuse messages are sent exclusively with one A little mailboxes（Such as 139 mailboxes）Proprietary free text message passage is as the instrument accumulated wealth by unfair means or reaches hidden purpose.It is short Believe one of value-added service as mobile communication, provide cheap and easily communication service for people, while also grow The largely refuse messages for the purpose of propagating the flames such as obscene pornographic, business swindle and commercial advertisement.These refuse messages Severe jamming people's lives, have harmed social safety, the supervision problem of refuse messages is by the extensive attention of various circles of society. Except from legislation aspect strengthen to release news supervise it is outer, it is often more important that in technological layer exploration short message rubbish filtering row Effective precautionary technology.

In the prior art, the filter method of refuse messages mainly has two kinds：Short message rubbish based on keyword or based on content Rubbish filters.

Filtering junk short messages based on keyword are that system sets some keywords in advance, as long as there is this in short message content A little keywords, then regard as refuse messages and intercepted, this method basis for estimation is single, can be lacked in the presence of what is largely judged by accident Fall into.

Filtering junk short messages based on content are that short message is divided into normal short message and refuse messages using machine learning.At present Mainly there are Bayes, SVM, KNN and artificial neural network etc. for SMS classified machine learning method.The filter method is also deposited Erroneous judgement the defects of.

The content of the invention

It is an object of the invention to provide a kind of short message filter method for being based on " user's credit worthiness and short message spam degree " The method for disclosing propagating contents and being supervised and being filtered submitted to user.

To achieve the above object, the short message filtering side of the present invention based on " user's credit worthiness and short message spam degree " Method, including step：

A）Situation is enlivened according to short-message users, to each one initial credit worthiness of user；

B）Text Pretreatment：Normal punctuation mark in text is first rejected, the interference character record that the system that identifies is set Count and reject, replace the numeral and pictographic code of specific coding；

C）Phone number and URL addresses are extracted, carries out short message corelation behaviour feature extraction；

D）Keyword increases spam degree base attribute newly, based on B）Content after step Text Pretreatment does Keywords matching, And each keyword that record matching arrives；

E）Similar content defines, based on the short breath spam degree of Similarity Measure；

F）Credit worthiness and short breath spam degree with reference to user, judge whether to intercept.

It is an object of the invention to which short message content and user behavior are carried out into comprehensive marking, formation is made a concerted effort, in conjunction with Family credit worthiness determines whether refuse messages, as far as possible catching rubbish short message, and reduces and intercept by mistake to high prestige user's Influence.

The present invention enlivens situation according to user, is daily carried to each one initial credit worthiness of user, then using hadoop Take family to count using the behavior of each business, real-time servicing user's credit worthiness.

Then Text Pretreatment is carried out.Normal punctuation mark in text is first rejected, identifies the noise word that system is set Symbol (such as ぁ) record number is simultaneously rejected, and replaces the numeral and pictographic code (such as 4., 〇) of specific coding.

Based on the content after second step processing, phone number and URL addresses are extracted, and judge whether phone number is original String content.The extraction of user itself behavioural characteristic is sent, such as：Different-place login, new registration user, short message issue mortality height etc.（Can Expand）.Similar content feature extraction, such as：Sender's Area distribution, sender log in IP distributions, recipient's Area distribution, sent Frequency etc.（It is extendible）.Feature calculation spam degree based on extraction, carry out refuse messages identification.

Keyword increases spam degree base attribute newly, and Keywords matching, and record are done based on the content after Text Pretreatment Each keyword being fitted on.Keyword based on matching calculates spam degree, while collects the result of the 3rd step clearing, and it is short to carry out rubbish Believe identifying processing.

Similar content defines.Based on Similarity Measure spam degree, and collect the result of the 4th step, carry out refuse messages identification Processing.

With reference to the credit worthiness and short message spam degree of user, judge whether to intercept.Spam degree is moderate, and allows what user issued Short message, while carry out user's credit worthiness deduction.

The present invention can more accurately realize the filtering to the short breath of rubbish based on user's credit worthiness and short message spam degree, subtract The erroneous judgement of the short breath of few rubbish.

Brief description of the drawings

Fig. 1 is a kind of flow chart of the embodiment of the present invention to filtering junk short messages；

Fig. 2 is the flow chart that a kind of embodiment of the present invention is safeguarded to user's credit worthiness；

Fig. 3 is the flow chart of the embodiment of Text Pretreatment step shown in Fig. 1；

Fig. 4 is the flow chart of the embodiment of behavioural characteristic processing step shown in Fig. 1；

Fig. 5 is the flow chart of the embodiment of Keywords matching step shown in Fig. 1；

Fig. 6 is the flow chart for the embodiment that similarity shown in Fig. 1 defines step；

Fig. 7 is the flow chart of the embodiment of doubtful refuse messages processing step shown in Fig. 1.

Embodiment

The present invention is described in further details with specific embodiment below in conjunction with the accompanying drawings.

Fig. 1-Fig. 7 is a kind of flow chart of the embodiment of the present invention to filtering junk short messages.In this example, it incite somebody to action this Invention rubbish filtering method incorporates and is embodied in characteristic processing step, keyword processing step and similarity define, and normally In short message handling process, doubtful garbage disposal flow and refuse messages handling process.Normal short message handling process, doubtful rubbish are short Letter handling process and refuse messages handling process are mainly safeguarded for user's credit worthiness provides main data supporting.

In this example, rubbish filtering method of the present invention will be according to the determination that give a mark of the text information of short message and feature The no filter method for refuse messages, handled successively using behavioural characteristic, Keywords matching and similarity define three kinds of methods With reference to the accuracy that raising refuse messages judge.

Meanwhile in this example, rubbish filtering method of the present invention is also in relation with black/white list filter method, i.e. blacklist User's credit worthiness forbids sending any short message for 0, and white list user credit worthiness is that the short message that 1 acquiescence is sent is normal.

Five handling processes are described in detail below.

User's credit worthiness maintenance process-》The flow includes credit worthiness initialization, credit worthiness is deducted in unlawful practice and active The cumulative credit worthiness three parts of behavior.Wherein deduct that credit worthiness unlawful practice includes submitting refuse messages and to issue doubtful rubbish short Letter, by the way of deducting in real time；The cumulative credit worthiness of behavior is enlivened to carry out by the way of hadoop timing analysis；At the beginning of credit worthiness Beginningization rule：

Behavioural characteristic handling process-》The flow is mainly to extract the related behavioural characteristic of short message, such as commercial paper short message Generally comprise be mingled with phone number or URL addresses, key content interference character or using specific coding character (as 4./ (13)), while refuse messages also possess the characteristic of mass-sending, therefore are also necessary to carry out IP distributions, recipient region to Similar content Distribution, sender's Regional Distribution etc. are analyzed, and are collected above- mentioned information and are carried out spam degree calculating to short message, then determine whether Refuse messages.During intermediate treatment, only identify whether as refuse messages, i.e., only judge whether spam degree exceedes refuse messages Predetermined threshold values.If it is determined that refuse messages, then carry out credit worthiness deduction of points.

Keywords matching handling process-》General keyword, combination keyword and sensitive keys word are defined into rubbish value first Attribute, then the flow Keywords matching is done with regard to pretreated text, spam degree calculating is carried out to the keyword that matches, together When it is cumulative before total spam degree.Spam degree is finally based on to determine whether refuse messages.During intermediate treatment, only know Not whether to be not refuse messages, i.e., only judge whether spam degree exceedes the predetermined threshold values of refuse messages.Certainly in Keywords matching mistake Cheng Zhong, canonical matching can also be done using original contents string.

Similarity define handling process-》The flow is the history refuse messages for having intercepted, and does fingerprint similarity Match somebody with somebody, calculate maximum similarity, at the same similarity is cumulative more than doing for certain value, and be rubbish the association attributes conversion of extraction Rubbish degree (can also be classified) using bayesian algorithm to text, at the same it is cumulative before total spam degree.Total spam degree is less than Doubtful rubbish threshold values, then directly handled as normal short message, higher than refuse messages threshold values, be then determined as refuse messages, otherwise, hold The doubtful garbage disposal flow of row.

Doubtful garbage disposal flow-》Done and judged based on user's credit worthiness, processing mode is as follows：

Mandate for doubtful refuse messages is issued, and credit worthiness deduction is done according to user's credit worthiness and doubtful spam degree, meter Calculating formula, (it is assumed that credit worthiness divides n shelves, using C1, C2 ..., Cn is represented, C1 is maximum as follows；T1, T2 ... Tn represent credit worthiness stepping it Between allow send bar number；Each class rubbish contribution a reference values of B1, B2 ... Bn；G is spam degree)：

Credit worthiness deduction value=(C1-C2)/T1* (G/B1).

Above content is to combine specific preferred embodiment further description made for the present invention, it is impossible to is assert The specific implementation of the present invention is confined to these explanations.For general technical staff of the technical field of the invention, On the premise of not departing from present inventive concept, some simple deduction or replace can also be made, should all be considered as belonging to the present invention's Protection domain.

Claims

1. one kind is based on the short message filter method of " user's credit worthiness and short message spam degree ", including step：

A）Situation is enlivened according to short-message users, to each one initial credit worthiness of user；After providing the initial credit worthiness of user, Daily extract user using hadoop to count using the behavior of each business, real-time servicing user's credit worthiness；The real-time servicing is used Family credit worthiness includes：The initialization of reputation degree, unlawful practice deduct credit worthiness and enliven the cumulative credit worthiness three parts of behavior；

B）Text Pretreatment：Normal punctuation mark in text is first rejected, identifies the interference character record number of system setting simultaneously Reject, replace the numeral and pictographic code of specific coding；

C）Phone number and URL addresses are extracted, carries out short message corelation behaviour feature extraction；The short message corelation behaviour feature includes User itself behavioural characteristic and Similar content feature；

D）Keyword increases spam degree base attribute newly, based on B）Content after step Text Pretreatment does Keywords matching, and remembers Record each keyword matched；

F）Credit worthiness and short breath spam degree with reference to user, judge whether to intercept；

Itself behavioural characteristic of described family issues mortality including different-place login, new registration user, short message；Described Similar content Feature includes sender's Area distribution, sender logs in IP distributions, recipient's Area distribution, transmission frequency.