One kind is based on the short message filter method of " user's credit worthiness and short message spam degree "
Technical field
The invention belongs to the short message treatment technology in Internet communication technology field, it relates in particular to a kind of internet
The short message filter method that the sms platform of communication system is based on " user's credit worthiness and short message spam degree " is submitted to user
The method for disclosing propagating contents and being supervised and being filtered.
Background technology
In recent years, with the high speed development of mailbox service, some criminals of refuse messages are sent exclusively with one
A little mailboxes(Such as 139 mailboxes)Proprietary free text message passage is as the instrument accumulated wealth by unfair means or reaches hidden purpose.It is short
Believe one of value-added service as mobile communication, provide cheap and easily communication service for people, while also grow
The largely refuse messages for the purpose of propagating the flames such as obscene pornographic, business swindle and commercial advertisement.These refuse messages
Severe jamming people's lives, have harmed social safety, the supervision problem of refuse messages is by the extensive attention of various circles of society.
Except from legislation aspect strengthen to release news supervise it is outer, it is often more important that in technological layer exploration short message rubbish filtering row
Effective precautionary technology.
In the prior art, the filter method of refuse messages mainly has two kinds:Short message rubbish based on keyword or based on content
Rubbish filters.
Filtering junk short messages based on keyword are that system sets some keywords in advance, as long as there is this in short message content
A little keywords, then regard as refuse messages and intercepted, this method basis for estimation is single, can be lacked in the presence of what is largely judged by accident
Fall into.
Filtering junk short messages based on content are that short message is divided into normal short message and refuse messages using machine learning.At present
Mainly there are Bayes, SVM, KNN and artificial neural network etc. for SMS classified machine learning method.The filter method is also deposited
Erroneous judgement the defects of.
The content of the invention
It is an object of the invention to provide a kind of short message filter method for being based on " user's credit worthiness and short message spam degree "
The method for disclosing propagating contents and being supervised and being filtered submitted to user.
To achieve the above object, the short message filtering side of the present invention based on " user's credit worthiness and short message spam degree "
Method, including step:
A)Situation is enlivened according to short-message users, to each one initial credit worthiness of user;
B)Text Pretreatment:Normal punctuation mark in text is first rejected, the interference character record that the system that identifies is set
Count and reject, replace the numeral and pictographic code of specific coding;
C)Phone number and URL addresses are extracted, carries out short message corelation behaviour feature extraction;
D)Keyword increases spam degree base attribute newly, based on B)Content after step Text Pretreatment does Keywords matching,
And each keyword that record matching arrives;
E)Similar content defines, based on the short breath spam degree of Similarity Measure;
F)Credit worthiness and short breath spam degree with reference to user, judge whether to intercept.
It is an object of the invention to which short message content and user behavior are carried out into comprehensive marking, formation is made a concerted effort, in conjunction with
Family credit worthiness determines whether refuse messages, as far as possible catching rubbish short message, and reduces and intercept by mistake to high prestige user's
Influence.
The present invention enlivens situation according to user, is daily carried to each one initial credit worthiness of user, then using hadoop
Take family to count using the behavior of each business, real-time servicing user's credit worthiness.
Then Text Pretreatment is carried out.Normal punctuation mark in text is first rejected, identifies the noise word that system is set
Symbol (such as ぁ) record number is simultaneously rejected, and replaces the numeral and pictographic code (such as 4., 〇) of specific coding.
Based on the content after second step processing, phone number and URL addresses are extracted, and judge whether phone number is original
String content.The extraction of user itself behavioural characteristic is sent, such as:Different-place login, new registration user, short message issue mortality height etc.(Can
Expand).Similar content feature extraction, such as:Sender's Area distribution, sender log in IP distributions, recipient's Area distribution, sent
Frequency etc.(It is extendible).Feature calculation spam degree based on extraction, carry out refuse messages identification.
Keyword increases spam degree base attribute newly, and Keywords matching, and record are done based on the content after Text Pretreatment
Each keyword being fitted on.Keyword based on matching calculates spam degree, while collects the result of the 3rd step clearing, and it is short to carry out rubbish
Believe identifying processing.
Similar content defines.Based on Similarity Measure spam degree, and collect the result of the 4th step, carry out refuse messages identification
Processing.
With reference to the credit worthiness and short message spam degree of user, judge whether to intercept.Spam degree is moderate, and allows what user issued
Short message, while carry out user's credit worthiness deduction.
The present invention can more accurately realize the filtering to the short breath of rubbish based on user's credit worthiness and short message spam degree, subtract
The erroneous judgement of the short breath of few rubbish.
Brief description of the drawings
Fig. 1 is a kind of flow chart of the embodiment of the present invention to filtering junk short messages;
Fig. 2 is the flow chart that a kind of embodiment of the present invention is safeguarded to user's credit worthiness;
Fig. 3 is the flow chart of the embodiment of Text Pretreatment step shown in Fig. 1;
Fig. 4 is the flow chart of the embodiment of behavioural characteristic processing step shown in Fig. 1;
Fig. 5 is the flow chart of the embodiment of Keywords matching step shown in Fig. 1;
Fig. 6 is the flow chart for the embodiment that similarity shown in Fig. 1 defines step;
Fig. 7 is the flow chart of the embodiment of doubtful refuse messages processing step shown in Fig. 1.
Embodiment
The present invention is described in further details with specific embodiment below in conjunction with the accompanying drawings.
Fig. 1-Fig. 7 is a kind of flow chart of the embodiment of the present invention to filtering junk short messages.In this example, it incite somebody to action this
Invention rubbish filtering method incorporates and is embodied in characteristic processing step, keyword processing step and similarity define, and normally
In short message handling process, doubtful garbage disposal flow and refuse messages handling process.Normal short message handling process, doubtful rubbish are short
Letter handling process and refuse messages handling process are mainly safeguarded for user's credit worthiness provides main data supporting.
In this example, rubbish filtering method of the present invention will be according to the determination that give a mark of the text information of short message and feature
The no filter method for refuse messages, handled successively using behavioural characteristic, Keywords matching and similarity define three kinds of methods
With reference to the accuracy that raising refuse messages judge.
Meanwhile in this example, rubbish filtering method of the present invention is also in relation with black/white list filter method, i.e. blacklist
User's credit worthiness forbids sending any short message for 0, and white list user credit worthiness is that the short message that 1 acquiescence is sent is normal.
Five handling processes are described in detail below.
User's credit worthiness maintenance process-》The flow includes credit worthiness initialization, credit worthiness is deducted in unlawful practice and active
The cumulative credit worthiness three parts of behavior.Wherein deduct that credit worthiness unlawful practice includes submitting refuse messages and to issue doubtful rubbish short
Letter, by the way of deducting in real time;The cumulative credit worthiness of behavior is enlivened to carry out by the way of hadoop timing analysis;At the beginning of credit worthiness
Beginningization rule:
Behavioural characteristic handling process-》The flow is mainly to extract the related behavioural characteristic of short message, such as commercial paper short message
Generally comprise be mingled with phone number or URL addresses, key content interference character or using specific coding character (as 4./
(13)), while refuse messages also possess the characteristic of mass-sending, therefore are also necessary to carry out IP distributions, recipient region to Similar content
Distribution, sender's Regional Distribution etc. are analyzed, and are collected above- mentioned information and are carried out spam degree calculating to short message, then determine whether
Refuse messages.During intermediate treatment, only identify whether as refuse messages, i.e., only judge whether spam degree exceedes refuse messages
Predetermined threshold values.If it is determined that refuse messages, then carry out credit worthiness deduction of points.
Keywords matching handling process-》General keyword, combination keyword and sensitive keys word are defined into rubbish value first
Attribute, then the flow Keywords matching is done with regard to pretreated text, spam degree calculating is carried out to the keyword that matches, together
When it is cumulative before total spam degree.Spam degree is finally based on to determine whether refuse messages.During intermediate treatment, only know
Not whether to be not refuse messages, i.e., only judge whether spam degree exceedes the predetermined threshold values of refuse messages.Certainly in Keywords matching mistake
Cheng Zhong, canonical matching can also be done using original contents string.
Similarity define handling process-》The flow is the history refuse messages for having intercepted, and does fingerprint similarity
Match somebody with somebody, calculate maximum similarity, at the same similarity is cumulative more than doing for certain value, and be rubbish the association attributes conversion of extraction
Rubbish degree (can also be classified) using bayesian algorithm to text, at the same it is cumulative before total spam degree.Total spam degree is less than
Doubtful rubbish threshold values, then directly handled as normal short message, higher than refuse messages threshold values, be then determined as refuse messages, otherwise, hold
The doubtful garbage disposal flow of row.
Doubtful garbage disposal flow-》Done and judged based on user's credit worthiness, processing mode is as follows:
Mandate for doubtful refuse messages is issued, and credit worthiness deduction is done according to user's credit worthiness and doubtful spam degree, meter
Calculating formula, (it is assumed that credit worthiness divides n shelves, using C1, C2 ..., Cn is represented, C1 is maximum as follows;T1, T2 ... Tn represent credit worthiness stepping it
Between allow send bar number;Each class rubbish contribution a reference values of B1, B2 ... Bn;G is spam degree):
Credit worthiness deduction value=(C1-C2)/T1* (G/B1).
The present invention can more accurately realize the filtering to the short breath of rubbish based on user's credit worthiness and short message spam degree, subtract
The erroneous judgement of the short breath of few rubbish.
Above content is to combine specific preferred embodiment further description made for the present invention, it is impossible to is assert
The specific implementation of the present invention is confined to these explanations.For general technical staff of the technical field of the invention,
On the premise of not departing from present inventive concept, some simple deduction or replace can also be made, should all be considered as belonging to the present invention's
Protection domain.