CN103761221B

CN103761221B - System and method for identifying sensitive text messages

Info

Publication number: CN103761221B
Application number: CN201310749656.9A
Authority: CN
Inventors: 何泉昊; 权圣; 陆强
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2013-12-31
Filing date: 2013-12-31
Publication date: 2017-05-17
Anticipated expiration: 2033-12-31
Also published as: CN103761221A

Abstract

The invention discloses a system and a method for identifying sensitive text messages. The system for identifying the sensitive text messages comprises a data training module, a data testing module and an information source partitioning identification module; the data training module is used for expressing training texts to be vector space formed character space modules; the data testing module is used for expressing testing texts to be vector space formed character space modules; the information partitioning identification module is used for dividing a testing text assembly into a fuzzy region and a non-fuzzy region and identifying the fuzzy region and the non-fuzzy region in a classified mode.

Description

System and a method for identifying sensitive text messages

Technical field

The present invention relates to a kind of system and a method for identifying sensitive text messages.

Background technology

The text message of such as microblogging etc. has become the diversified booster of climate and amplifier, in the public opinion environment There is great change in role.If network public opinion environment suffer it is illegal destroy, run counter to freedom of information and just biography The principle broadcast, can cause wrongful commercial competition etc., and then upset the public.Network public opinion environment complicated and changeable and Behind various doms are ordered about, and are brought to the effective monitoring of the text message such as microblogging and community forum BBS unprecedented Challenge.

In the security fields for processing content of text messages, the technology for adopting at present mainly has rule-based method and is based on The method of probability statistics.

（1）Rule-based method to refer to that design in advance one group regular, to indicate whether certain information belongs to sensitive Information.As shown in figure 1, the method mainly by data input pretreatment module, data rule extraction module, rule judgment module and As a result output module is realizing.It is made whether the judgement of qualified correlation, rule using rule judgment module to data Judge module is nucleus module, and in concrete implementation, rule-based several typical methods are as follows：

Based on IP, domain name and routing rule：Black and white lists are arranged to IP, IP will be filtered in the information source of blacklist Process, and IP is let slip in the information source of white list；The configuration of server end can realize to Access control List, The server ends such as Top wrappers, Host routes table are configured；Safety certifying method has test/answering system and calculates test system System etc..

Based on content rule and rule of conduct：It is for example in mail excessive comprising adertisement or English capitalization Use, and excessively using for the correlation word such as house property or medicine, or excessively gaudy HTML pattern colors etc., these interior content regulations Then once identified, the information source will be filtered；Information in information source mail head is judged to group mail And the IP for monitoring the information source at MTA ends has exceeded flow threshold within the time of regulation, then carried out the special place such as filtering Reason.

（2）Refer to using some features classify different contents based on the method for probability statistics, calculate certain Feature belongs to estimating for certain classification, takes maximum, if the classification belonging to this maximum is sensitive kinds, the information is entered Row relevant treatment.As shown in Fig. 2 the method is mainly by test data input module, training data training module, grader classification Module, result output module are realizing.Data training process is the process of a statistical learning, obtains corresponding grader.Instruction Practicing the sorting algorithm used during grader can determine according to practical application scene, such as naive Bayesian, the classification such as K-mean Algorithm etc..

The shortcoming that rule-based method is present：Poor in regular unconspicuous application effect, some are normal Information source can often be classified as improper information source.Even if in regular significantly application, when information source maker is known After all of rule, in order to get around rule treatments, its behavior can become more hidden.Rule-based method another Whether problem, information source can be identified as the reader and put up the different and different of position that sensitive information faced because of it, for For some specific users, bulletin message or wikipedia, whether it is sensitive information that those can be used to be explicitly indicated, and its His occasion may become quite normal.I.e. because different user defines the standard difference of sensitive information, in addition it is also necessary to for different use Family, group etc. set up respective example and data acquisition system.

Also there is respective shortcoming in the method based on probability statistics, such as naive Bayesian because of the different algorithm for adopting The maximum defect of grader is exactly that he cannot process the result of variations produced by feature based combination, when we assume that word is " beautiful State " and " 911 " are non-sensitive word, and sensitive information can also be identified as non-sensitive information and put actually as " U.S. 911 " Cross.Again the major defect of such as K-mean is that, in order to find closest data item, each data with prediction all must Must and all of data item be compared and indispensable, in the face of million even up to ten million data sets, over time and space All it is very poorly efficient.

Accordingly, it is desired to provide a kind of system and method for high performance identification sensitivity text message.

The content of the invention

The present invention is proposed to solve at least one of disadvantages mentioned above of the prior art and problem.Based on existing skill The shortcoming that art is present, we have proposed the set of division information source, and information source is successively known using different types of feature Method for distinguishing, on the one hand it can show higher performance when large data sets are processed；On the other hand, the method is applied to quick When sense information is recognized, also have in effect than common sorting algorithm and greatly improve.

According on one side, the present invention proposes a kind of system for recognizing sensitive text message, including：Data are trained Module, for training text to be expressed as the feature space model of vector space form；Data test module, for text will to be tested Originally it is expressed as the feature space model of vector space form；And information source piecemeal identification module, for according to text point two The distribution of dimension space, is divided into test text set confusion region and non-fuzzy area and confusion region and non-fuzzy area is entered respectively Row Classification and Identification.

Alternatively, the data training module includes：Training text pretreatment module, for carrying out pre- place to training text Reason；Feature extraction module, for carrying out feature extraction according to the pre-processed results of the training text pretreatment module；And it is special Selecting module is levied, the feature for being extracted to the feature extraction module carries out feature selection, so as to by word, word and words The feature of string composition carries out feature selection and obtains feature space.

Alternatively, the data test module includes：Test text pretreatment module, for carrying out pre- place to test text Reason；Feature extraction module, for carrying out feature extraction according to the pre-processed results of the test text pretreatment module；And it is special Selecting module is levied, the feature for being extracted to the feature extraction module carries out feature selection, so as to by word, word and words The feature of string composition carries out feature selection and obtains feature space.

Alternatively, described information source piecemeal identification module includes：Region division module, for empty in two dimension according to text point Between distribution the test text set is divided into into the confusion region and the non-fuzzy area；First Classification and Identification module, uses In carrying out Classification and Identification to the confusion region as feature using word or word；And the second Classification and Identification module, for adjacent two The binary character string of individual word or word composition carries out Classification and Identification as feature to the non-fuzzy area.

Alternatively, the word or word are obtained by participle instrument.

According to another aspect of the present invention, there is provided a kind of method for recognizing sensitive text message, including：Will instruction Practice feature space model of the text representation for vector space form；Test text is expressed as into the feature space of vector space form Model；According to text point in the distribution of two-dimensional space, test text set is divided into into confusion region and non-fuzzy area；With word or word As feature, Classification and Identification is carried out to the confusion region；And using the binary character string that is made up of two neighboring word or word as Feature, to the non-fuzzy area Classification and Identification is carried out.

Alternatively, the feature space model that training text is expressed as vector space form is included：Training text is carried out Pretreatment；Feature extraction is carried out to pre-processed results；Feature to being extracted carries out feature selection.

Alternatively, the feature space model that test text is expressed as vector space form is included：Test text is carried out Pretreatment；Feature extraction is carried out to pre-processed results；Feature to being extracted carries out feature selection.

Alternatively, the word or word are obtained by participle instrument.

Alternatively, using Bayes or K-means as sorting algorithm training grader, by test text set It is divided into the confusion region and the non-fuzzy area.

Description of the drawings

By the description for carrying out below in conjunction with the accompanying drawings, the above and other aspect of some one exemplary embodiments of the invention, spy Advantage of seeking peace will become clear to those skilled in the art, wherein：

Fig. 1 is the block diagram for realizing rule-based method；

Fig. 2 is the block diagram for realizing based on the method for probability statistics；

Fig. 3 is the block diagram for recognizing the system of sensitive text message；And

Fig. 4 is the flow chart for recognizing the method for sensitive text message.

Specific embodiment

There is provided and describe to help the one exemplary embodiment of the comprehensive understanding present invention below refer to the attached drawing.It includes various thin They should be thought only exemplary by section to help understanding.Therefore, those of ordinary skill in the art should recognize Arrive, various changes and modifications can be made to the embodiments described herein, without departing from scope and spirit of the present invention.Together Sample, for clarity and conciseness, eliminates the description to known function and structure.

The system and method for identification sensitivity text message of the invention draw test text set according to fuzzy region It is divided into two parts, and this two-part text is identified respectively using different types of feature.The technical scheme is to tradition Improved based on probabilistic method technology, the accurate rate and recall rate for making identification and classification results has obtained substantially carrying Height, and different algorithms can realize combination for sensitive information identification and text classification field, in the efficient of big data quantity Rate classification aspect has critically important application potential.

Fig. 3 is the block diagram for recognizing the system of sensitive text message.

As shown in figure 3, the system for recognizing sensitive text message of the invention include data training module 310, Data test module 320, information source piecemeal identification module 330.

Data training module 310 is used to that training text to be expressed as the feature space model of vector space form.

In one embodiment, data training module 310 includes training text pretreatment module 312, feature extraction module 314th, feature selection module 316.

Training text pretreatment module 312 is used to carry out pretreatment to training text.For example, training text pretreatment module 312 can remove punctuation mark and idle character, filter stop words and then carry out participle to training text from training text.Stop Word can include ", ", " be, ... " etc..

Feature extraction module 314 can carry out feature extraction according to the pre-processed results of training text pretreatment module 312. For example, feature extraction module 314 can according to demand select the binary character string of word, word and words composition as feature.The The feature that one step is selected, it should with stronger covering power, there is stronger resolution capability, therefore chooses through the common of participle Used as feature, recognition effect is better than single word or word combination to word；The feature that second step is selected, it should can be to the first step Insecure those texts of recognition result have stronger resolution capability, and semantic feature will be projected more, and word combination Just there is the preferable feature of semanteme, such as word combination " item property " has higher than single word " commodity ", " attribute " Semantic effect, therefore second step select text participle using latter two adjacent binary word string as feature.As an example, Participle instrument, the first step can be utilized directly to choose the word segmentation result of participle instrument as feature, second step chooses participle instrument Two adjacent word combinations are used as feature after participle.

The feature that feature selection module 316 is extracted to feature extraction module 314 carries out feature selection, so as to by word, The feature of word and words string composition carries out feature selection and obtains the feature space represented by vector space model.For example, it is special Levying the adoptable method of selection includes word frequency (Term Frequency) statistic law, document frequency (Document Frequency) Statistic law, inverse document frequency（IDF）Method, mutual information method, CHI side's statistic law, information gain method etc..For example, feature selection module 316 can carry out feature selection using CHI side's statistic law.

Similar with data training module 310, data test module 320 is used to for test text to be expressed as vector space form Feature space model.

In one embodiment, data test module 320 includes test text pretreatment module 322, feature extraction module 324th, feature selection module 326.

Test text pretreatment module 322 is used to carry out pretreatment to training text.For example, training text pretreatment module 322 can remove punctuation mark and idle character, filter stop words and then carry out participle to training text from training text.Stop Word can include ", ", " be, ... " etc..

Feature extraction module 324 can carry out feature extraction according to the pre-processed results of training text pretreatment module 322. For example, feature extraction module 324 can according to demand select the binary character string of word, word and words composition as feature.Such as With described in the past, the feature that the first step is selected, it should with stronger covering power, there is stronger resolution capability, therefore choose Through participle common words as feature, recognition effect is better than single word or word combination；The feature that second step is selected, should This can those texts insecure to first step recognition result there is stronger resolution capability, and semantic feature will more dash forward Go out, and word combination just have the preferable feature of semanteme, such as word combination " item property " than single word " commodity ", " attribute " has higher semantic effect, therefore second step selects text participle using latter two adjacent binary word string as spy Levy.

The feature that feature selection module 326 is extracted to feature extraction module 324 carries out feature selection, so as to by word, The feature of word and words string composition carries out feature selection and obtains the feature space represented by vector space model.For example, it is special Levying the adoptable method of selection includes word frequency (Term Frequency) statistic law, document frequency (Document Frequency) Statistic law, inverse document frequency（IDF）Method, mutual information method, CHI side's statistic law, information gain method etc..For example, feature selection module 316 can carry out feature selection using CHI side's statistic law.

Information source piecemeal identification module 330 is used for the distribution according to text point in two-dimensional space, and test text set is drawn It is divided into confusion region A and non-fuzzy area B, first using word or word as feature, Classification and Identification is carried out to confusion region A, then with by adjacent The binary character string of two words or word composition carries out Classification and Identification as feature to non-fuzzy area B.

In one embodiment, information source piecemeal identification module 330 includes region division module 332, the first Classification and Identification Module 334, the second Classification and Identification module 336.Region division module 332 can be used for the distribution according to text point in two-dimensional space Test text set is divided into into confusion region A and non-fuzzy area B.First Classification and Identification module 334 can be used for making with word or word Be characterized carries out Classification and Identification to confusion region A.Second Classification and Identification module 334 can be used for the two of two neighboring word or word composition Metacharacter string carries out Classification and Identification as feature to non-fuzzy area B.

For example, information source piecemeal identification module 330 can be instructed arbitrarily using Bayes or K-means as sorting algorithm Practice grader, with by test text set-partition as two parts：Text collection A outside confusion region, the text collection in confusion region B。

Specifically, by taking Bayesian Classification Arithmetic as an example, two-value text vector d=(w are given₁,w₂,...,w_D)（Wherein, w_i=0 Or 1；If ith feature is occurred in text d, w_i=1, otherwise w_i=0）, text d_xBelong to classification c_jProbability be represented by P(c_j|d_x), by text d after calculating_xAssign to the maximum classification of income value, P (c_j|d_x) computing formula can be expressed as：

Wherein, n is characterized number, and then discriminant function is expressed as：

Order

The discrimination formula of grader can distinguish approximate representation:f(d)=MV_max-MV_sec, with MV_maxFor abscissa, with MV_sec For vertical coordinate construction two-dimensional space, a point (x, y) text d being expressed as in two-dimensional space, the point is to segmentation straight line f (d) =0 apart from dist is：The size of its value can reflect the first step by text d_xIt is identified as sensitive kinds c^* The degree of reliability, and then determine the first step recognize when unreliable part.Wherein, Dist is experience boundary constant.

For example, in one embodiment, using totally 5000 record evidence, wherein comprising nonsensitive data 4000, sensitive number According to 1000.The first step, by nonsensitive data text and sensitive data stochastic averagina 4 parts are divided into, wherein 3 parts of composing training data Collection, 1 part of composition test data set.Cross validation is carried out by 4 hurdles, 4 hurdle laboratory mean values are used as final index.First, from training Extractor, word in data, feature selection constitutive characteristic space is carried out, then extracted by identical characteristic type from test text Feature, classifies according to a certain algorithm to test text, and result is estimated, if reliable results, the result is made For final identification judged result, if unreliable, the text to be identified is placed on into fuzzy interval, stays in the process of the 2nd step； 2nd step, using the diverse feature of characteristic type with the 1st step, first carries out study statistics constitutive characteristic empty to training data Between, the feature that this feature type is then extracted from test text is identified, and is made according to the recognition result of two steps and finally being sentenced Disconnected, through cross validation, the invention recognizes the pattra leaves of accurate rate, recall rate than single step using bayesian algorithm to sensitive information This method effect is significantly improved.

Fig. 4 is the flow chart for recognizing the method for sensitive text message.

As shown in Figure 4, the method starts from step 410.In step 410, training text is expressed as into vector space shape The feature space model of formula.

It is possible, firstly, to carry out pretreatment to training text.For example, punctuation mark and invalidation word can be removed from training text Accord with, filter stop words and then participle is carried out to training text.Stop words can include ", ", " be, ... " etc..

It is then possible to pre-processed results carry out feature extraction.For example, word, word and words group can according to demand be selected Into binary character string as feature.

Finally, feature selection can be carried out to the feature for being extracted, so as to enter to the feature being made up of word, word and words string Row feature selection and obtain the feature space represented by vector space model.For example, the adoptable method of feature selection includes Word frequency (Term Frequency) statistic law, document frequency (Document Frequency) statistic law, inverse document frequency（IDF） Method, mutual information method, CHI side's statistic law, information gain method etc..It is for instance possible to use CHI side's statistic law carries out feature selection.

In step 420, test text is expressed as into the feature space model of vector space form.

It is then possible to carry out feature extraction according to pre-processed results.For example, word, word and word can according to demand be selected The binary character string of word composition is used as feature.

Finally, the feature to being extracted carries out feature selection, so as to carry out spy to the feature being made up of word, word and words string Levy selection and obtain the feature space represented by vector space model.For example, the adoptable method of feature selection includes word frequency (Term Frequency) statistic law, document frequency (Document Frequency) statistic law, inverse document frequency（IDF）Method, Mutual information method, CHI side's statistic law, information gain method etc..It is for instance possible to use CHI side's statistic law carries out feature selection.

In step 430, according to text point in the distribution of two-dimensional space, text collection to be identified is divided into into confusion region A With non-fuzzy area B.

In step 440, using word or word as feature, Classification and Identification is carried out to confusion region A.

In step 450, using the binary character string that is made up of two neighboring word or word as feature, non-fuzzy area B is carried out Classification and Identification.

For example, arbitrarily grader can be trained as sorting algorithm using Bayes bayes or K-means, will be surveyed Examination text collection is divided into two parts：Text collection A outside confusion region, the text collection B in confusion region.

By the concrete side that test text set-partition is the text collection B in text collection A and confusion region outside confusion region Method may refer to above for the details described by Fig. 3, will not be described in detail herein.

It is to be noted that above apparatus and method of the present invention embodiment is described respectively respectively, but to one The details of individual embodiment description also apply be applicable to another embodiment.

The ultimate principle of the present invention is described above in association with specific embodiment, however, it is desirable to, it is noted that to this area For those of ordinary skill, it is to be understood that whole either any step or part of the method for the present invention and system can be with soft Part, hardware, firmware or combinations thereof are realized that this is those of ordinary skill in the art in the explanation for having read the present invention In the case of can be achieved with their basic programming skill.

Therefore, the purpose of the present invention can also be soft by one software module of operation on any computing device or one group Part module is realizing.The computing device can be known fexible unit.Therefore, the purpose of the present invention can also be only by There is provided the program product comprising the program code for realizing methods described or device to realize.That is, such program is produced Product also constitute the present invention, and the storage medium of such program product that is stored with also constitutes the present invention.Obviously, the storage is situated between Matter can be any known storage medium or any storage medium for being developed in the future.

Should not be to any by these detailed explanations although this specification includes many particular implementation details Invention or the restriction of the scope of content that can be advocated, and should be construed to can be specific to the specific embodiment of specific invention Feature description.Can also be by some combinations of features in this manual described in the situation of detached embodiment in list Realize in individual embodiment.On the contrary, can also be by the character separation of each described in the situation in single embodiment many Realize in individual embodiment or realize in any appropriate sub-portfolio.Although additionally, may describe feature as above Work in some combinations, or even initially advocate thus, but can be in some cases by from the one of the combination advocated Individual or multiple features are left out from combination, and the combination advocated can be pointed to the variant of sub-portfolio or sub-portfolio.

Similarly, although in the accompanying drawings operation is depicted with certain order, but this should not be interpreted as needing with institute The certain order shown performs such operation or needs the operation for performing all diagrams to can be only achieved the phase with sequential order The result of prestige.In some cases, multitask and parallel processing can be favourable.Additionally, should not be by above-mentioned enforcement The separation of the various system components in example is interpreted as being required to such separation in all embodiments, and it should be appreciated that Generally can be by described program assembly and the system integration are together into single software product or be encapsulated as multiple softwares and produce Product.

Computer program（Also referred to as program, software, software application, script or code）Programming language that can be in any form Speech is write, and the programming language includes compiling or interpretative code or illustrative or procedural language, and it can be with any shape Formula is disposed, including as stand-alone program or as module, component, subprogram or being suitable to other lists for using in a computing environment Unit.Computer program not necessarily corresponds to the file in file system.Program storage can kept other programs or number According to file（For example, one or more scripts being stored in marking language document）A part, be exclusively used in discuss in journey The single file or multiple coordination files of sequence（For example, the file of one or more modules, subprogram or partial code is stored） In.

Above-mentioned specific embodiment, does not constitute limiting the scope of the invention.Those skilled in the art should be bright It is white, depending on design requirement and other factors, various modifications, combination, sub-portfolio and replacement can occur.It is any Modification, equivalent and improvement for being made within the spirit and principles in the present invention etc., should be included in the scope of the present invention Within.

Claims

1. a kind of system for recognizing sensitive text message, including：

Data training module, for training text to be expressed as the feature space model of vector space form；

Data test module, for test text to be expressed as the feature space model of vector space form；And

Information source piecemeal identification module, in the distribution of two-dimensional space, by test text set mould is divided into for according to text point Paste area and non-fuzzy area and Classification and Identification is carried out respectively to confusion region and non-fuzzy area；

The data training module includes：Training text pretreatment module, for carrying out pretreatment to training text；Fisrt feature Abstraction module, for carrying out feature extraction according to the pre-processed results of the training text pretreatment module；Fisrt feature is selected Module, the feature for being extracted to the feature extraction module carries out feature selection, so as to being made up of word, word and words string Feature carry out feature selection and obtain feature space；

The data test module includes：Test text pretreatment module, for carrying out pretreatment to test text；Second feature Abstraction module, for carrying out feature extraction according to the pre-processed results of the test text pretreatment module；Second feature is selected Module, the feature for being extracted to the feature extraction module carries out feature selection, so as to being made up of word, word and words string Feature carry out feature selection and obtain feature space；

Described information source piecemeal identification module includes：Region division module, incites somebody to action for the distribution according to text point in two-dimensional space The test text set is divided into the confusion region and the non-fuzzy area；First Classification and Identification module, for word or word Classification and Identification is carried out to the confusion region as feature；Second Classification and Identification module, for what is constituted with two neighboring word or word Binary character string carries out Classification and Identification as feature to the non-fuzzy area.

2. system according to claim 1, wherein the word or word are obtained by participle instrument.

3. a kind of method for recognizing sensitive text message, including：

Training text is expressed as into the feature space model of vector space form；

Test text is expressed as into the feature space model of vector space form；

According to text point in the distribution of two-dimensional space, test text set is divided into into confusion region and non-fuzzy area；

Using word or word as feature, Classification and Identification is carried out to the confusion region；And

Using the binary character string that is made up of two neighboring word or word as feature, Classification and Identification is carried out to the non-fuzzy area；

Wherein the feature space model that training text is expressed as vector space form is included：Pretreatment is carried out to training text； Feature extraction is carried out to pre-processed results；Feature to being extracted carries out feature selection；

Wherein the feature space model that test text is expressed as vector space form is included：Pretreatment is carried out to test text； Feature extraction is carried out to pre-processed results；Feature to being extracted carries out feature selection.

4. method according to claim 3, wherein, the word or word are obtained by participle instrument.

5. method according to claim 3, wherein, trained point as sorting algorithm using Bayes or K-means Class device, with by test text set-partition as the confusion region and the non-fuzzy area.