CN106951437A - Identifying processing method and device suitable for the sensitive words and phrases of multiple Chinese - Google Patents

Identifying processing method and device suitable for the sensitive words and phrases of multiple Chinese Download PDF

Info

Publication number
CN106951437A
CN106951437A CN201710072161.5A CN201710072161A CN106951437A CN 106951437 A CN106951437 A CN 106951437A CN 201710072161 A CN201710072161 A CN 201710072161A CN 106951437 A CN106951437 A CN 106951437A
Authority
CN
China
Prior art keywords
phrases
sensitive words
character
chinese
identified
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710072161.5A
Other languages
Chinese (zh)
Other versions
CN106951437B (en
Inventor
喻民
刘超
卢越
李敏
姜建国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN201710072161.5A priority Critical patent/CN106951437B/en
Publication of CN106951437A publication Critical patent/CN106951437A/en
Application granted granted Critical
Publication of CN106951437B publication Critical patent/CN106951437B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Abstract

The present invention provides a kind of identifying processing method and device suitable for the sensitive words and phrases of multiple Chinese, and this method includes:Obtain multiple default sensitive words and phrases;Suffix tree is set up according to the sensitive words and phrases;Obtain Chinese text to be identified;The Chinese text to be identified is matched according to the suffix tree;If after the match is successful, obtaining the sensitive words and phrases and output display in the Chinese version to be identified, the characteristics of this method is for Chinese, match time of the pattern string on suffix tree byBring up toReach the saving time and improve matching speed of the pattern string on suffix tree, it is adaptable to the Chinese String matching of multiple sensitive words and phrases.

Description

Identifying processing method and device suitable for the sensitive words and phrases of multiple Chinese
Technical field
At computer processing technology field, more particularly to a kind of identification suitable for the sensitive words and phrases of multiple Chinese Manage method and device.
Background technology
The sensitive words and phrases of identification refer to carry out information text the crucial words that sniff is specified using program, check whether against It is anti-to specify tactful behavior, it is the basis of filtering sensitive words.Need to apply some moulds to fast and accurately search sensitive vocabulary Formula matching algorithm.
The pattern matching algorithm of pattern string has Aho-Corasick (AC) algorithm, BM algorithms, ACBM algorithms.Wherein, AC is calculated Multiple pattern strings are converted to tree-shaped finite automaton state machine (DFSA) by method by pretreatment, can to text string run-down To complete all pattern matchings, the time complexity of matching is O (n+m).The time complexity of BM algorithms is But Multi-Pattern Matching problem can not be handled.Efficiency under ACBM algorithm fusions AC algorithms and BM algorithm ideas, average case Better than AC algorithms, time complexity isAlthough ACBM algorithms show excellent in actual applications, for Chinese Effect is poor and the characteristics of failing the pattern string and Chinese information that make full use of, causes matching speed slower.
The reason for causing poorly efficient is that the fundamental structural unit of English is " word ", and Chinese fundamental structural unit is " word ". There is very big difference during sensitive words and phrases detection.For English, sensitive word detection is that 26 English alphabets are matched successively, and It is that Chinese character up to ten thousand is matched successively for Chinese.Therefore string matching algorithm, is become by 26 English alphabets After Chinese character up to ten thousand, algorithm Expected Results is all unable to reach over time and space.Other Chinese character is multibyte symbol, also simultaneously Possess the attribute that the English alphabets such as phonetic do not have also not to be fully utilized in the algorithm.
The content of the invention
The present invention provides a kind of identifying processing method and device suitable for the sensitive words and phrases of multiple Chinese, for solving existing skill The problem of slower to Chinese sensitive word sentence matching speed in art.
In a first aspect, the present invention provides a kind of identifying processing method suitable for the sensitive words and phrases of multiple Chinese, including:
Obtain multiple default sensitive words and phrases;
Suffix tree is set up according to the sensitive words and phrases;
Obtain Chinese text to be identified;
The Chinese text to be identified is matched according to the suffix tree;
If after the match is successful, obtaining the sensitive words and phrases and output display in the Chinese version to be identified.
Alternatively, it is described to set up suffix tree according to the sensitive words and phrases, including:
S21, according to multiple default sensitive words and phrases, establishment model set of strings P (P1,P2,P3,P4,P5...Pn);
S22, one root node of setting, the property value of the root node is the first preset value, and first preset value is any The arrangement value of phonetic alphabet;
S23, any sensitive words and phrases P chosen in the pattern set of stringsi, the sensitive words and phrases PiString length be m;
S24, the acquisition sensitive words and phrases PiM-th of character, to m-th of character resolution obtain correspondence phonetic head word Mother, the arrangement value of the head letter is obtained according to the alphabetical and default phonetic alphabet of the head and the corresponding relation of arrangement value;
Whether S25, the arrangement value for judging the head letter, if being less than, m-th of character are corresponded to less than the first preset value Node be arranged on the left side of the root node, conversely, being then arranged on the right side of the root node;
S25, the sensitive words and phrases P is obtained successivelyiThe character of m-1, m-2 ... ..., 2,1, circulation step S24-S25 The corresponding node of the character of m-1, m-2 ... ..., 2,1 is arranged on to the child nodes of the character nodes of m, m-1 ... ..., 2 On.
Alternatively, the Chinese text to be identified is matched according to the suffix tree, including:According to the suffix tree The Chinese text to be identified is matched using BM algorithms.
Alternatively, the sensitive words and phrases include individual character, phrase and sentence.
Alternatively, also include:If after matching is failed, sending prompt message.
Second aspect, the present invention provides a kind of recognition process unit suitable for the sensitive words and phrases of multiple Chinese, including:
First acquisition module, for obtaining multiple default sensitive words and phrases;
Processing module, for setting up suffix tree according to the sensitive words and phrases;
Second acquisition module, for obtaining Chinese text to be identified;
Matching module, for being matched according to the suffix tree to the Chinese text to be identified;
Display module, for after the match is successful, obtaining sensitive words and phrases and output display in the Chinese version to be identified.
Alternatively, the processing module specifically for:
S21, according to multiple default sensitive words and phrases, establishment model set of strings P (P1,P2,P3,P4,P5...Pn);
S22, one root node of setting, the property value of the root node is the first preset value, and first preset value is any The arrangement value of phonetic alphabet;
S23, any sensitive words and phrases P chosen in the pattern set of stringsi, the sensitive words and phrases PiString length be m;
S24, the acquisition sensitive words and phrases PiM-th of character, to m-th of character resolution obtain correspondence phonetic head word Mother, the arrangement value of the head letter is obtained according to the alphabetical and default phonetic alphabet of the head and the corresponding relation of arrangement value;
Whether S25, the arrangement value for judging the head letter, if being less than, m-th of character are corresponded to less than the first preset value Node be arranged on the left side of the root node, conversely, being then arranged on the right side of the root node;
S25, the sensitive words and phrases P is obtained successivelyiThe character of m-1, m-2 ... ..., 2,1, circulation step S24-S25 The corresponding node of the character of m-1, m-2 ... ..., 2,1 is arranged on to the child nodes of the character nodes of m, m-1 ... ..., 2 On.
Alternatively, the matching module specifically for:According to the suffix tree using BM algorithms to the Chinese to be identified Text is matched.
Alternatively, the sensitive words and phrases include individual character, phrase and sentence.
Alternatively, the display module is additionally operable to:After matching is failed, prompt message is sent.
As shown from the above technical solution, the identifying processing method and device of the sensitive words and phrases of multiple Chinese of the invention, passes through The multiple default sensitive words and phrases of acquisition are parsed and use pinyin character arrangement value to set up suffix tree, Chinese text to be identified is being obtained After this, the Chinese text to be identified is matched according to suffix tree, according to the matching of the alphanumeric arrangement value branch of character, when With the sensitive words and phrases and output display obtained after success in the Chinese version to be identified, the characteristics of accomplishing for Chinese, pattern Go here and there match time on suffix tree byBring up toReach the saving time and improve pattern string rear Sew the matching speed on tree, it is adaptable to the Chinese String matching of multiple sensitive words and phrases.
Brief description of the drawings
Fig. 1 is the flow signal for the identifying processing method for being suitable to the sensitive words and phrases of multiple Chinese that the embodiment of the present invention 1 is provided Figure;
Fig. 2 is the block diagram of suffix tree provided in an embodiment of the present invention;
Fig. 3 is the structural representation for the recognition process unit for being suitable to the sensitive words and phrases of multiple Chinese that the embodiment of the present invention 2 is provided Figure.
Embodiment
With reference to the accompanying drawings and examples, the embodiment to the present invention is described in further detail.Implement below Example is used to illustrate the present invention, but is not limited to the scope of the present invention.
Fig. 1 shows that the embodiment of the present invention 1 provides a kind of identifying processing method suitable for the sensitive words and phrases of multiple Chinese, bag Include:
S11, the multiple default sensitive words and phrases of acquisition.
In this step, it is necessary to which explanation, in embodiments of the present invention, the sensitive words and phrases are prior default word Sentence.Typically it may include to include individual character, phrase and sentence.Individual character such as " stupid ", " stupid ".Phrase such as " wretch " " violence ".Sentence such as " I Disagreeable China ".
S12, according to the sensitive words and phrases set up suffix tree.
In this step, it is necessary to illustrate, in embodiments of the present invention, for the ease of being matched subsequently from text message Sensitive words and phrases, need to set up suffix tree, specific as follows:
S21, according to multiple default sensitive words and phrases, establishment model set of strings P (P1,P2,P3,P4,P5...Pn);
S22, one root node of setting, the property value of the root node is the first preset value, and first preset value is any The arrangement value of phonetic alphabet;
S23, any sensitive words and phrases P chosen in the pattern set of stringsi, the sensitive words and phrases PiString length be m;
S24, the acquisition sensitive words and phrases PiM-th of character, to m-th of character resolution obtain correspondence phonetic head word Mother, the arrangement value of the head letter is obtained according to the alphabetical and default phonetic alphabet of the head and the corresponding relation of arrangement value;
Whether S25, the arrangement value for judging the head letter, if being less than, m-th of character are corresponded to less than the first preset value Node be arranged on the left side of the root node, conversely, being then arranged on the right side of the root node;
S25, the sensitive words and phrases P is obtained successivelyiThe character of m-1, m-2 ... ..., 2,1, circulation step S24-S25 The corresponding node of the character of m-1, m-2 ... ..., 2,1 is arranged on the character nodes of m, m-1 ... ..., 2.
Above-mentioned steps are explained with specific example:
As illustrated in fig. 2, it is assumed that pattern set of strings is P (P1,P2,P3,P4), P1For " stupid ", P2For " pornographic ", P3For " you like France ", P4For " his Francophobe ".
One root node is set, and the property value of the root node is 13.
Obtain sensitive words and phrases P1, the sensitive words and phrases P1String length be 1.Parsing " stupid " to character obtains correspondence phonetic Head it is alphabetical " b ", the head is obtained according to the alphabetical and default phonetic alphabet of the head and the corresponding relation of arrangement value alphabetical Arrangement value " 2 ".Judge that " 2 " are less than the property value " 13 " of root node, then the node of character " stupid " is arranged on to the left side of root node It is used as the child nodes of root node.
Obtain sensitive words and phrases P2, the sensitive words and phrases P2String length be 2.Correspondence is obtained to the 2nd character " feelings " parsing The head of phonetic is alphabetical " q ", and the head word is obtained according to the alphabetical and default phonetic alphabet of the head and the corresponding relation of arrangement value Female arrangement value " 17 ", judges that " 17 " are more than the property value " 13 " of root node, then the node of character " feelings " is arranged on into root node Right side as root node child nodes.The head for obtaining correspondence phonetic to the 1st character " color " parsing is alphabetical " s ", according to institute The arrangement value " 19 " that the alphabetical and default phonetic alphabet of head obtain the head letter with the corresponding relation of arrangement value is stated, is judged " 19 " More than the property value " 17 " of character " feelings " node, then the right side that the node of character " color " is arranged on to character " feelings " node is used as root The child nodes of node.
Obtain sensitive words and phrases P3, the sensitive words and phrases P3String length be 4.Correspondence is obtained to the 4th character " state " parsing The head of phonetic is alphabetical " g ", and the head word is obtained according to the alphabetical and default phonetic alphabet of the head and the corresponding relation of arrangement value Female arrangement value " 7 ", judges that " 7 " are less than the property value " 13 " of root node, then the node of character " state " is arranged on into root node Left side as root node child nodes.Above-mentioned processing is done to " method ", " love ", " you " successively, be will not be repeated here, it is seen that Fig. 2 It is shown.
Obtain sensitive words and phrases P4, the sensitive words and phrases P4String length be 5.Correspondence is obtained to the 5th character " state " parsing The head of phonetic is alphabetical " g ", and the head word is obtained according to the alphabetical and default phonetic alphabet of the head and the corresponding relation of arrangement value Female arrangement value " 7 ", judges that " 7 " are less than the property value " 13 " of root node, then the node of character " state " is arranged on into root node Left side as root node child nodes.Above-mentioned processing is done to " method " " detesting ", " begging for ", " he " successively, be will not be repeated here, it is seen that Shown in Fig. 2.
S13, acquisition Chinese text to be identified.
In this step, it is necessary to explanation, in embodiments of the present invention, Chinese text to be identified can be publish an article or Comment on message etc..
S14, according to the suffix tree Chinese text to be identified is matched.
In this step, it is necessary to illustrate, as character string PiLength be more than text character length, now, just can not Character string P can be found from the texti, therefore, the character length of Chinese text is more than pattern string PiCharacter length, i.e. len (T) > maxlen (Pi)。
The suffix tree is matched using BM algorithms to the Chinese text to be identified, specifically be may include:
(1) according to most short pattern string PiLength minlen (Pi) selected target string minlen (Pi) position for starting Matched position, BM algorithmic match is carried out using the tree.
(2) if during some charactor comparison mismatch, using two heuristic rules, i.e. batter's symbol rule is become reconciled suffix rule Then.
(3) if during the matching of some charactor comparison, first judging the size of the first from left character and the matching character, size according to The phonetic assignment of character.If the first from left character is less than the matching character, go in left-side child nodes to search, if it is not, Then go in right-side child nodes to be searched.
If S15, after the match is successful, obtaining the sensitive words and phrases and output display in the Chinese version to be identified.
If in addition, after matching is failed, prompt message can be sent, to point out to be used for that Chinese text can be delivered.
What the embodiment of the present invention 1 was provided is suitable to the identifying processing method of the sensitive words and phrases of multiple Chinese, by multiple to obtaining Default sensitive words and phrases parse and use pinyin character arrangement value to set up suffix tree, after Chinese text to be identified is obtained, according to Suffix tree matches to the Chinese text to be identified, is matched according to the alphanumeric arrangement value branch of character, after the match is successful The sensitive words and phrases and output display in the Chinese version to be identified are obtained, the characteristics of accomplishing for Chinese, pattern string in suffix Match time on tree byBring up toReach the saving time and improve pattern string on suffix tree Matching speed, it is adaptable to the Chinese String matching of multiple sensitive words and phrases.
Fig. 3 shows a kind of recognition process unit suitable for the sensitive words and phrases of multiple Chinese that the embodiment of the present invention 2 is provided, bag The first acquisition module 21, processing module 22, the second acquisition module 23, matching module 24 and display module 25 are included, wherein:
First acquisition module 21, for obtaining multiple default sensitive words and phrases;
Processing module 22, for setting up suffix tree according to the sensitive words and phrases;
Second acquisition module 23, for obtaining Chinese text to be identified;
Matching module 24, for being matched according to the suffix tree to the Chinese text to be identified;
Display module 25, for after the match is successful, obtaining the sensitive words and phrases in the Chinese version to be identified and exporting aobvious Show.
The processing module specifically for:
S21, according to multiple default sensitive words and phrases, establishment model set of strings P (P1,P2,P3,P4,P5...Pn);
S22, one root node of setting, the property value of the root node is the first preset value, and first preset value is any The arrangement value of phonetic alphabet;
S23, any sensitive words and phrases P chosen in the pattern set of stringsi, the sensitive words and phrases PiString length be m;
S24, the acquisition sensitive words and phrases PiM-th of character, to m-th of character resolution obtain correspondence phonetic head word Mother, the arrangement value of the head letter is obtained according to the alphabetical and default phonetic alphabet of the head and the corresponding relation of arrangement value;
Whether S25, the arrangement value for judging the head letter, if being less than, m-th of character are corresponded to less than the first preset value Node be arranged on the left side of the root node, conversely, being then arranged on the right side of the root node;
S25, the sensitive words and phrases P is obtained successivelyiThe character of m-1, m-2 ... ..., 2,1, circulation step S24-S25 The corresponding node of the character of m-1, m-2 ... ..., 2,1 is arranged on to the child nodes of the character nodes of m, m-1 ... ..., 2 On.
Because the described device of the embodiment of the present invention 2 is identical with the principle of above-described embodiment methods described, in further detail Explanation content will not be repeated here.
It should be noted that can be by hardware processor (hardware processor) come real in the embodiment of the present invention Existing related function module.
What the embodiment of the present invention 2 was provided is suitable to the recognition process unit of the sensitive words and phrases of multiple Chinese, by multiple to obtaining Default sensitive words and phrases parse and use pinyin character arrangement value to set up suffix tree, after Chinese text to be identified is obtained, according to Suffix tree matches to the Chinese text to be identified, is matched according to the alphanumeric arrangement value branch of character, after the match is successful The sensitive words and phrases and output display in the Chinese version to be identified are obtained, the characteristics of accomplishing for Chinese, pattern string in suffix Match time on tree byBring up toReach the saving time and improve pattern string on suffix tree Matching speed, it is adaptable to the Chinese String matching of multiple sensitive words and phrases.
Although in addition, it will be appreciated by those of skill in the art that some embodiments described herein include other embodiments In included some features rather than further feature, but the combination of the feature of be the same as Example does not mean in of the invention Within the scope of and form different embodiments.For example, in the following claims, times of embodiment claimed One of meaning mode can be used in any combination.
It should be noted that the present invention will be described rather than limits the invention for above-described embodiment, and ability Field technique personnel can design alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference symbol between bracket should not be configured to limitations on claims.Word "comprising" is not excluded the presence of not Element listed in the claims or step.Word "a" or "an" before element does not exclude the presence of multiple such Element.The present invention can be by means of including the hardware of some different elements and coming real by means of properly programmed computer It is existing.In if the unit claim of equipment for drying is listed, several in these devices can be by same hardware branch To embody.The use of word first, second, and third does not indicate that any order.These words can be explained and run after fame Claim.
One of ordinary skill in the art will appreciate that:Various embodiments above is merely illustrative of the technical solution of the present invention, and It is non-that it is limited;Although the present invention is described in detail with reference to foregoing embodiments, one of ordinary skill in the art It should be understood that:It can still modify to the technical scheme described in foregoing embodiments, or to which part or All technical characteristic carries out equivalent;And these modifications or replacement, the essence of appropriate technical solution is departed from this hair Bright claim limited range.

Claims (10)

1. a kind of identifying processing method suitable for the sensitive words and phrases of multiple Chinese, it is characterised in that including:
Obtain multiple default sensitive words and phrases;
Suffix tree is set up according to the sensitive words and phrases;
Obtain Chinese text to be identified;
The Chinese text to be identified is matched according to the suffix tree;
If after the match is successful, obtaining the sensitive words and phrases and output display in the Chinese version to be identified.
2. according to the method described in claim 1, it is characterised in that described to set up suffix tree according to the sensitive words and phrases, including:
S21, according to multiple default sensitive words and phrases, establishment model set of strings P (P1,P2,P3,P4,P5...Pn);
S22, one root node of setting, the property value of the root node is the first preset value, and first preset value is any phonetic The arrangement value of letter;
S23, any sensitive words and phrases P chosen in the pattern set of stringsi, the sensitive words and phrases PiString length be m;
S24, the acquisition sensitive words and phrases PiM-th of character, to m-th of character resolution obtain correspondence phonetic head letter, according to The alphabetical and default phonetic alphabet of the head obtain the arrangement value of the head letter with the corresponding relation of arrangement value;
S25, judge whether the arrangement value of head letter is less than the first preset value, if being less than, by the corresponding section of m-th of character Point is arranged on the left side of the root node, conversely, being then arranged on the right side of the root node;
S25, the sensitive words and phrases P is obtained successivelyiThe character of m-1, m-2 ... ..., 2,1, circulation step S24-S25 is by m- 1, m-2 ... ..., 2,1 corresponding nodes of character are arranged in the child nodes of the character nodes of m, m-1 ... ..., 2.
3. according to the method described in claim 1, it is characterised in that the Chinese text to be identified is entered according to the suffix tree Row matching, including:The Chinese text to be identified is matched using BM algorithms according to the suffix tree.
4. according to the method described in claim 1, it is characterised in that the sensitive words and phrases include individual character, phrase and sentence.
5. according to the method described in claim 1, it is characterised in that if after matching is failed, sending prompt message.
6. a kind of recognition process unit suitable for the sensitive words and phrases of multiple Chinese, it is characterised in that including:
First acquisition module, for obtaining multiple default sensitive words and phrases;
Processing module, for setting up suffix tree according to the sensitive words and phrases;
Second acquisition module, for obtaining Chinese text to be identified;
Matching module, for being matched according to the suffix tree to the Chinese text to be identified;
Display module, for after the match is successful, obtaining sensitive words and phrases and output display in the Chinese version to be identified.
7. device according to claim 6, it is characterised in that the processing module specifically for:
S21, according to multiple default sensitive words and phrases, establishment model set of strings P (P1,P2,P3,P4,P5...Pn);
S22, one root node of setting, the property value of the root node is the first preset value, and first preset value is any phonetic The arrangement value of letter;
S23, any sensitive words and phrases P chosen in the pattern set of stringsi, the sensitive words and phrases PiString length be m;
S24, the acquisition sensitive words and phrases PiM-th of character, to m-th of character resolution obtain correspondence phonetic head letter, according to The alphabetical and default phonetic alphabet of the head obtain the arrangement value of the head letter with the corresponding relation of arrangement value;
S25, judge whether the arrangement value of head letter is less than the first preset value, if being less than, by the corresponding section of m-th of character Point is arranged on the left side of the root node, conversely, being then arranged on the right side of the root node;
S25, the sensitive words and phrases P is obtained successivelyiThe character of m-1, m-2 ... ..., 2,1, circulation step S24-S25 is by m- 1, m-2 ... ..., 2,1 corresponding nodes of character are arranged in the child nodes of the character nodes of m, m-1 ... ..., 2.
8. device according to claim 6, it is characterised in that the matching module specifically for:According to the suffix tree The Chinese text to be identified is matched using BM algorithms.
9. device according to claim 6, it is characterised in that the sensitive words and phrases include individual character, phrase and sentence.
10. device according to claim 6, it is characterised in that the display module is additionally operable to:After matching is failed, Send prompt message.
CN201710072161.5A 2017-02-08 2017-02-08 Identifying processing method and device suitable for the sensitive words and phrases of multiple Chinese Active CN106951437B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710072161.5A CN106951437B (en) 2017-02-08 2017-02-08 Identifying processing method and device suitable for the sensitive words and phrases of multiple Chinese

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710072161.5A CN106951437B (en) 2017-02-08 2017-02-08 Identifying processing method and device suitable for the sensitive words and phrases of multiple Chinese

Publications (2)

Publication Number Publication Date
CN106951437A true CN106951437A (en) 2017-07-14
CN106951437B CN106951437B (en) 2019-11-01

Family

ID=59465486

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710072161.5A Active CN106951437B (en) 2017-02-08 2017-02-08 Identifying processing method and device suitable for the sensitive words and phrases of multiple Chinese

Country Status (1)

Country Link
CN (1) CN106951437B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111062199A (en) * 2019-11-05 2020-04-24 北京中科微澜科技有限公司 Bad information identification method and device
CN111159990A (en) * 2019-12-06 2020-05-15 国家计算机网络与信息安全管理中心 Method and system for recognizing general special words based on mode expansion
CN111831785A (en) * 2020-07-16 2020-10-27 平安科技(深圳)有限公司 Sensitive word detection method and device, computer equipment and storage medium
CN113157904A (en) * 2021-03-30 2021-07-23 北京优医达智慧健康科技有限公司 Sensitive word filtering method and system based on DFA algorithm

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514238A (en) * 2012-06-30 2014-01-15 重庆新媒农信科技有限公司 Sensitive word recognition processing method based on classification searching
US20150100304A1 (en) * 2013-10-07 2015-04-09 Xerox Corporation Incremental computation of repeats
CN105843950A (en) * 2016-04-12 2016-08-10 乐视控股(北京)有限公司 Sensitive word filtering method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103514238A (en) * 2012-06-30 2014-01-15 重庆新媒农信科技有限公司 Sensitive word recognition processing method based on classification searching
US20150100304A1 (en) * 2013-10-07 2015-04-09 Xerox Corporation Incremental computation of repeats
CN105843950A (en) * 2016-04-12 2016-08-10 乐视控股(北京)有限公司 Sensitive word filtering method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LJSSPACE: "后缀树(Suffix Tree)的文本匹配算法", 《HTTPS://BLOG.CSDN.NET/LJSSPACE/ARTICLE/DETAILS/6571467》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111062199A (en) * 2019-11-05 2020-04-24 北京中科微澜科技有限公司 Bad information identification method and device
CN111062199B (en) * 2019-11-05 2023-12-22 北京中科微澜科技有限公司 Bad information identification method and device
CN111159990A (en) * 2019-12-06 2020-05-15 国家计算机网络与信息安全管理中心 Method and system for recognizing general special words based on mode expansion
CN111159990B (en) * 2019-12-06 2022-09-30 国家计算机网络与信息安全管理中心 Method and system for identifying general special words based on pattern expansion
CN111831785A (en) * 2020-07-16 2020-10-27 平安科技(深圳)有限公司 Sensitive word detection method and device, computer equipment and storage medium
CN113157904A (en) * 2021-03-30 2021-07-23 北京优医达智慧健康科技有限公司 Sensitive word filtering method and system based on DFA algorithm
CN113157904B (en) * 2021-03-30 2024-02-09 北京优医达智慧健康科技有限公司 Sensitive word filtering method and system based on DFA algorithm

Also Published As

Publication number Publication date
CN106951437B (en) 2019-11-01

Similar Documents

Publication Publication Date Title
CN110321432B (en) Text event information extraction method, electronic device and nonvolatile storage medium
Li et al. Text-level discourse dependency parsing
Nguyen et al. Relation extraction: Perspective from convolutional neural networks
CN106951437A (en) Identifying processing method and device suitable for the sensitive words and phrases of multiple Chinese
Fonseca et al. Mac-morpho revisited: Towards robust part-of-speech tagging
Bartoli et al. Automatic synthesis of regular expressions from examples
WO2017084267A1 (en) Method and device for keyphrase extraction
CN105095204B (en) The acquisition methods and device of synonym
US9558299B2 (en) Submatch extraction
Filice et al. Kelp: a kernel-based learning platform for natural language processing
US9460196B2 (en) Conditional string search
CN104252484B (en) A kind of phonetic error correction method and system
US20140214401A1 (en) Method and device for error correction model training and text error correction
CN111159363A (en) Knowledge base-based question answer determination method and device
CN111444330A (en) Method, device and equipment for extracting short text keywords and storage medium
CN111339268B (en) Entity word recognition method and device
CN105593845B (en) Generating means and its method based on the arrangement corpus for learning by oneself arrangement, destructive expression morpheme analysis device and its morpheme analysis method using arrangement corpus
WO2014117549A1 (en) Method and device for error correction model training and text error correction
WO2022222300A1 (en) Open relationship extraction method and apparatus, electronic device, and storage medium
CN103761225B (en) A kind of Chinese word semantic similarity calculation method of data-driven
Keraghel et al. Data augmentation process to improve deep learning-based ner task in the automotive industry field
JP6558852B2 (en) Clause identification apparatus, method, and program
US20160196303A1 (en) String search device, string search method, and string search program
KR101663038B1 (en) Entity boundary detection apparatus in text by usage-learning on the entity's surface string candidates and mtehod thereof
Celebi et al. Segmenting hashtags using automatically created training data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant