CN106951437A

CN106951437A - Identifying processing method and device suitable for the sensitive words and phrases of multiple Chinese

Info

Publication number: CN106951437A
Application number: CN201710072161.5A
Authority: CN
Inventors: 喻民; 刘超; 卢越; 李敏; 姜建国
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2017-02-08
Filing date: 2017-02-08
Publication date: 2017-07-14
Anticipated expiration: 2037-02-08
Also published as: CN106951437B

Abstract

The present invention provides a kind of identifying processing method and device suitable for the sensitive words and phrases of multiple Chinese, and this method includes：Obtain multiple default sensitive words and phrases；Suffix tree is set up according to the sensitive words and phrases；Obtain Chinese text to be identified；The Chinese text to be identified is matched according to the suffix tree；If after the match is successful, obtaining the sensitive words and phrases and output display in the Chinese version to be identified, the characteristics of this method is for Chinese, match time of the pattern string on suffix tree byBring up toReach the saving time and improve matching speed of the pattern string on suffix tree, it is adaptable to the Chinese String matching of multiple sensitive words and phrases.

Description

Identifying processing method and device suitable for the sensitive words and phrases of multiple Chinese

Technical field

At computer processing technology field, more particularly to a kind of identification suitable for the sensitive words and phrases of multiple Chinese Manage method and device.

Background technology

The sensitive words and phrases of identification refer to carry out information text the crucial words that sniff is specified using program, check whether against It is anti-to specify tactful behavior, it is the basis of filtering sensitive words.Need to apply some moulds to fast and accurately search sensitive vocabulary Formula matching algorithm.

The pattern matching algorithm of pattern string has Aho-Corasick (AC) algorithm, BM algorithms, ACBM algorithms.Wherein, AC is calculated Multiple pattern strings are converted to tree-shaped finite automaton state machine (DFSA) by method by pretreatment, can to text string run-down To complete all pattern matchings, the time complexity of matching is O (n+m).The time complexity of BM algorithms is But Multi-Pattern Matching problem can not be handled.Efficiency under ACBM algorithm fusions AC algorithms and BM algorithm ideas, average case Better than AC algorithms, time complexity isAlthough ACBM algorithms show excellent in actual applications, for Chinese Effect is poor and the characteristics of failing the pattern string and Chinese information that make full use of, causes matching speed slower.

The reason for causing poorly efficient is that the fundamental structural unit of English is " word ", and Chinese fundamental structural unit is " word ". There is very big difference during sensitive words and phrases detection.For English, sensitive word detection is that 26 English alphabets are matched successively, and It is that Chinese character up to ten thousand is matched successively for Chinese.Therefore string matching algorithm, is become by 26 English alphabets After Chinese character up to ten thousand, algorithm Expected Results is all unable to reach over time and space.Other Chinese character is multibyte symbol, also simultaneously Possess the attribute that the English alphabets such as phonetic do not have also not to be fully utilized in the algorithm.

The content of the invention

The present invention provides a kind of identifying processing method and device suitable for the sensitive words and phrases of multiple Chinese, for solving existing skill The problem of slower to Chinese sensitive word sentence matching speed in art.

In a first aspect, the present invention provides a kind of identifying processing method suitable for the sensitive words and phrases of multiple Chinese, including：

Obtain multiple default sensitive words and phrases；

Suffix tree is set up according to the sensitive words and phrases；

Obtain Chinese text to be identified；

The Chinese text to be identified is matched according to the suffix tree；

If after the match is successful, obtaining the sensitive words and phrases and output display in the Chinese version to be identified.

Alternatively, it is described to set up suffix tree according to the sensitive words and phrases, including：

S21, according to multiple default sensitive words and phrases, establishment model set of strings P (P₁,P₂,P₃,P₄,P₅...P_n)；

S22, one root node of setting, the property value of the root node is the first preset value, and first preset value is any The arrangement value of phonetic alphabet；

S23, any sensitive words and phrases P chosen in the pattern set of strings_i, the sensitive words and phrases P_iString length be m；

S24, the acquisition sensitive words and phrases P_iM-th of character, to m-th of character resolution obtain correspondence phonetic head word Mother, the arrangement value of the head letter is obtained according to the alphabetical and default phonetic alphabet of the head and the corresponding relation of arrangement value；

Whether S25, the arrangement value for judging the head letter, if being less than, m-th of character are corresponded to less than the first preset value Node be arranged on the left side of the root node, conversely, being then arranged on the right side of the root node；

S25, the sensitive words and phrases P is obtained successively_iThe character of m-1, m-2 ... ..., 2,1, circulation step S24-S25 The corresponding node of the character of m-1, m-2 ... ..., 2,1 is arranged on to the child nodes of the character nodes of m, m-1 ... ..., 2 On.

Alternatively, the Chinese text to be identified is matched according to the suffix tree, including：According to the suffix tree The Chinese text to be identified is matched using BM algorithms.

Alternatively, the sensitive words and phrases include individual character, phrase and sentence.

Alternatively, also include：If after matching is failed, sending prompt message.

Second aspect, the present invention provides a kind of recognition process unit suitable for the sensitive words and phrases of multiple Chinese, including：

First acquisition module, for obtaining multiple default sensitive words and phrases；

Processing module, for setting up suffix tree according to the sensitive words and phrases；

Second acquisition module, for obtaining Chinese text to be identified；

Matching module, for being matched according to the suffix tree to the Chinese text to be identified；

Display module, for after the match is successful, obtaining sensitive words and phrases and output display in the Chinese version to be identified.

Alternatively, the processing module specifically for：

Alternatively, the matching module specifically for：According to the suffix tree using BM algorithms to the Chinese to be identified Text is matched.

Alternatively, the display module is additionally operable to：After matching is failed, prompt message is sent.

As shown from the above technical solution, the identifying processing method and device of the sensitive words and phrases of multiple Chinese of the invention, passes through The multiple default sensitive words and phrases of acquisition are parsed and use pinyin character arrangement value to set up suffix tree, Chinese text to be identified is being obtained After this, the Chinese text to be identified is matched according to suffix tree, according to the matching of the alphanumeric arrangement value branch of character, when With the sensitive words and phrases and output display obtained after success in the Chinese version to be identified, the characteristics of accomplishing for Chinese, pattern Go here and there match time on suffix tree byBring up toReach the saving time and improve pattern string rear Sew the matching speed on tree, it is adaptable to the Chinese String matching of multiple sensitive words and phrases.

Brief description of the drawings

Fig. 1 is the flow signal for the identifying processing method for being suitable to the sensitive words and phrases of multiple Chinese that the embodiment of the present invention 1 is provided Figure；

Fig. 2 is the block diagram of suffix tree provided in an embodiment of the present invention；

Fig. 3 is the structural representation for the recognition process unit for being suitable to the sensitive words and phrases of multiple Chinese that the embodiment of the present invention 2 is provided Figure.

Embodiment

With reference to the accompanying drawings and examples, the embodiment to the present invention is described in further detail.Implement below Example is used to illustrate the present invention, but is not limited to the scope of the present invention.

Fig. 1 shows that the embodiment of the present invention 1 provides a kind of identifying processing method suitable for the sensitive words and phrases of multiple Chinese, bag Include：

S11, the multiple default sensitive words and phrases of acquisition.

In this step, it is necessary to which explanation, in embodiments of the present invention, the sensitive words and phrases are prior default word Sentence.Typically it may include to include individual character, phrase and sentence.Individual character such as " stupid ", " stupid ".Phrase such as " wretch " " violence ".Sentence such as " I Disagreeable China ".

S12, according to the sensitive words and phrases set up suffix tree.

In this step, it is necessary to illustrate, in embodiments of the present invention, for the ease of being matched subsequently from text message Sensitive words and phrases, need to set up suffix tree, specific as follows：

S25, the sensitive words and phrases P is obtained successively_iThe character of m-1, m-2 ... ..., 2,1, circulation step S24-S25 The corresponding node of the character of m-1, m-2 ... ..., 2,1 is arranged on the character nodes of m, m-1 ... ..., 2.

Above-mentioned steps are explained with specific example：

As illustrated in fig. 2, it is assumed that pattern set of strings is P (P₁,P₂,P₃,P₄), P₁For " stupid ", P₂For " pornographic ", P₃For " you like France ", P₄For " his Francophobe ".

One root node is set, and the property value of the root node is 13.

Obtain sensitive words and phrases P₁, the sensitive words and phrases P₁String length be 1.Parsing " stupid " to character obtains correspondence phonetic Head it is alphabetical " b ", the head is obtained according to the alphabetical and default phonetic alphabet of the head and the corresponding relation of arrangement value alphabetical Arrangement value " 2 ".Judge that " 2 " are less than the property value " 13 " of root node, then the node of character " stupid " is arranged on to the left side of root node It is used as the child nodes of root node.

Obtain sensitive words and phrases P₂, the sensitive words and phrases P₂String length be 2.Correspondence is obtained to the 2nd character " feelings " parsing The head of phonetic is alphabetical " q ", and the head word is obtained according to the alphabetical and default phonetic alphabet of the head and the corresponding relation of arrangement value Female arrangement value " 17 ", judges that " 17 " are more than the property value " 13 " of root node, then the node of character " feelings " is arranged on into root node Right side as root node child nodes.The head for obtaining correspondence phonetic to the 1st character " color " parsing is alphabetical " s ", according to institute The arrangement value " 19 " that the alphabetical and default phonetic alphabet of head obtain the head letter with the corresponding relation of arrangement value is stated, is judged " 19 " More than the property value " 17 " of character " feelings " node, then the right side that the node of character " color " is arranged on to character " feelings " node is used as root The child nodes of node.

Obtain sensitive words and phrases P₃, the sensitive words and phrases P₃String length be 4.Correspondence is obtained to the 4th character " state " parsing The head of phonetic is alphabetical " g ", and the head word is obtained according to the alphabetical and default phonetic alphabet of the head and the corresponding relation of arrangement value Female arrangement value " 7 ", judges that " 7 " are less than the property value " 13 " of root node, then the node of character " state " is arranged on into root node Left side as root node child nodes.Above-mentioned processing is done to " method ", " love ", " you " successively, be will not be repeated here, it is seen that Fig. 2 It is shown.

Obtain sensitive words and phrases P₄, the sensitive words and phrases P₄String length be 5.Correspondence is obtained to the 5th character " state " parsing The head of phonetic is alphabetical " g ", and the head word is obtained according to the alphabetical and default phonetic alphabet of the head and the corresponding relation of arrangement value Female arrangement value " 7 ", judges that " 7 " are less than the property value " 13 " of root node, then the node of character " state " is arranged on into root node Left side as root node child nodes.Above-mentioned processing is done to " method " " detesting ", " begging for ", " he " successively, be will not be repeated here, it is seen that Shown in Fig. 2.

S13, acquisition Chinese text to be identified.

In this step, it is necessary to explanation, in embodiments of the present invention, Chinese text to be identified can be publish an article or Comment on message etc..

S14, according to the suffix tree Chinese text to be identified is matched.

In this step, it is necessary to illustrate, as character string P_iLength be more than text character length, now, just can not Character string P can be found from the text_i, therefore, the character length of Chinese text is more than pattern string P_iCharacter length, i.e. len (T) ＞ maxlen (P_i)。

The suffix tree is matched using BM algorithms to the Chinese text to be identified, specifically be may include：

(1) according to most short pattern string P_iLength minlen (P_i) selected target string minlen (P_i) position for starting Matched position, BM algorithmic match is carried out using the tree.

(2) if during some charactor comparison mismatch, using two heuristic rules, i.e. batter's symbol rule is become reconciled suffix rule Then.

(3) if during the matching of some charactor comparison, first judging the size of the first from left character and the matching character, size according to The phonetic assignment of character.If the first from left character is less than the matching character, go in left-side child nodes to search, if it is not, Then go in right-side child nodes to be searched.

If S15, after the match is successful, obtaining the sensitive words and phrases and output display in the Chinese version to be identified.

If in addition, after matching is failed, prompt message can be sent, to point out to be used for that Chinese text can be delivered.

What the embodiment of the present invention 1 was provided is suitable to the identifying processing method of the sensitive words and phrases of multiple Chinese, by multiple to obtaining Default sensitive words and phrases parse and use pinyin character arrangement value to set up suffix tree, after Chinese text to be identified is obtained, according to Suffix tree matches to the Chinese text to be identified, is matched according to the alphanumeric arrangement value branch of character, after the match is successful The sensitive words and phrases and output display in the Chinese version to be identified are obtained, the characteristics of accomplishing for Chinese, pattern string in suffix Match time on tree byBring up toReach the saving time and improve pattern string on suffix tree Matching speed, it is adaptable to the Chinese String matching of multiple sensitive words and phrases.

Fig. 3 shows a kind of recognition process unit suitable for the sensitive words and phrases of multiple Chinese that the embodiment of the present invention 2 is provided, bag The first acquisition module 21, processing module 22, the second acquisition module 23, matching module 24 and display module 25 are included, wherein：

First acquisition module 21, for obtaining multiple default sensitive words and phrases；

Processing module 22, for setting up suffix tree according to the sensitive words and phrases；

Second acquisition module 23, for obtaining Chinese text to be identified；

Matching module 24, for being matched according to the suffix tree to the Chinese text to be identified；

Display module 25, for after the match is successful, obtaining the sensitive words and phrases in the Chinese version to be identified and exporting aobvious Show.

The processing module specifically for：

Because the described device of the embodiment of the present invention 2 is identical with the principle of above-described embodiment methods described, in further detail Explanation content will not be repeated here.

It should be noted that can be by hardware processor (hardware processor) come real in the embodiment of the present invention Existing related function module.

What the embodiment of the present invention 2 was provided is suitable to the recognition process unit of the sensitive words and phrases of multiple Chinese, by multiple to obtaining Default sensitive words and phrases parse and use pinyin character arrangement value to set up suffix tree, after Chinese text to be identified is obtained, according to Suffix tree matches to the Chinese text to be identified, is matched according to the alphanumeric arrangement value branch of character, after the match is successful The sensitive words and phrases and output display in the Chinese version to be identified are obtained, the characteristics of accomplishing for Chinese, pattern string in suffix Match time on tree byBring up toReach the saving time and improve pattern string on suffix tree Matching speed, it is adaptable to the Chinese String matching of multiple sensitive words and phrases.

Although in addition, it will be appreciated by those of skill in the art that some embodiments described herein include other embodiments In included some features rather than further feature, but the combination of the feature of be the same as Example does not mean in of the invention Within the scope of and form different embodiments.For example, in the following claims, times of embodiment claimed One of meaning mode can be used in any combination.

It should be noted that the present invention will be described rather than limits the invention for above-described embodiment, and ability Field technique personnel can design alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference symbol between bracket should not be configured to limitations on claims.Word "comprising" is not excluded the presence of not Element listed in the claims or step.Word "a" or "an" before element does not exclude the presence of multiple such Element.The present invention can be by means of including the hardware of some different elements and coming real by means of properly programmed computer It is existing.In if the unit claim of equipment for drying is listed, several in these devices can be by same hardware branch To embody.The use of word first, second, and third does not indicate that any order.These words can be explained and run after fame Claim.

One of ordinary skill in the art will appreciate that：Various embodiments above is merely illustrative of the technical solution of the present invention, and It is non-that it is limited；Although the present invention is described in detail with reference to foregoing embodiments, one of ordinary skill in the art It should be understood that：It can still modify to the technical scheme described in foregoing embodiments, or to which part or All technical characteristic carries out equivalent；And these modifications or replacement, the essence of appropriate technical solution is departed from this hair Bright claim limited range.

Claims

1. a kind of identifying processing method suitable for the sensitive words and phrases of multiple Chinese, it is characterised in that including：

Obtain multiple default sensitive words and phrases；

Suffix tree is set up according to the sensitive words and phrases；

Obtain Chinese text to be identified；

The Chinese text to be identified is matched according to the suffix tree；

2. according to the method described in claim 1, it is characterised in that described to set up suffix tree according to the sensitive words and phrases, including：

S22, one root node of setting, the property value of the root node is the first preset value, and first preset value is any phonetic The arrangement value of letter；

S24, the acquisition sensitive words and phrases P_iM-th of character, to m-th of character resolution obtain correspondence phonetic head letter, according to The alphabetical and default phonetic alphabet of the head obtain the arrangement value of the head letter with the corresponding relation of arrangement value；

S25, judge whether the arrangement value of head letter is less than the first preset value, if being less than, by the corresponding section of m-th of character Point is arranged on the left side of the root node, conversely, being then arranged on the right side of the root node；

S25, the sensitive words and phrases P is obtained successively_iThe character of m-1, m-2 ... ..., 2,1, circulation step S24-S25 is by m- 1, m-2 ... ..., 2,1 corresponding nodes of character are arranged in the child nodes of the character nodes of m, m-1 ... ..., 2.

3. according to the method described in claim 1, it is characterised in that the Chinese text to be identified is entered according to the suffix tree Row matching, including：The Chinese text to be identified is matched using BM algorithms according to the suffix tree.

4. according to the method described in claim 1, it is characterised in that the sensitive words and phrases include individual character, phrase and sentence.

5. according to the method described in claim 1, it is characterised in that if after matching is failed, sending prompt message.

6. a kind of recognition process unit suitable for the sensitive words and phrases of multiple Chinese, it is characterised in that including：

Second acquisition module, for obtaining Chinese text to be identified；

7. device according to claim 6, it is characterised in that the processing module specifically for：

8. device according to claim 6, it is characterised in that the matching module specifically for：According to the suffix tree The Chinese text to be identified is matched using BM algorithms.

9. device according to claim 6, it is characterised in that the sensitive words and phrases include individual character, phrase and sentence.

10. device according to claim 6, it is characterised in that the display module is additionally operable to：After matching is failed, Send prompt message.