CN108536693A - A kind of filtering sensitive words method, apparatus, electronic equipment, storage medium - Google Patents

A kind of filtering sensitive words method, apparatus, electronic equipment, storage medium Download PDF

Info

Publication number
CN108536693A
CN108536693A CN201710119329.3A CN201710119329A CN108536693A CN 108536693 A CN108536693 A CN 108536693A CN 201710119329 A CN201710119329 A CN 201710119329A CN 108536693 A CN108536693 A CN 108536693A
Authority
CN
China
Prior art keywords
text
user
character
sensitive
history
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710119329.3A
Other languages
Chinese (zh)
Inventor
陈朋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201710119329.3A priority Critical patent/CN108536693A/en
Publication of CN108536693A publication Critical patent/CN108536693A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The present invention provides a kind of filtering sensitive words method, apparatus, electronic equipment and storage medium, in this method, it is pre-configured with and loads sensitive dictionary, and sensitive dictionary is reloaded when sensitive dictionary changes, specific steps include:Obtain text input by user;Text input by user is traversed based on preset stride range;In ergodic process, for each text-string traversed, all sensitive words of the text character string with load in sensitive dictionary are matched, if a sensitive word being successfully matched in sensitive dictionary, then determine that there are sensitive words in text input by user, the traversing operation to text input by user is terminated, and forbids issuing text input by user;After traversal, if it is determined that all text-strings traversed not with the sensitive word successful match in sensitive dictionary, it is determined that in text input by user be not present sensitive word, allow to issue text input by user.

Description

A kind of filtering sensitive words method, apparatus, electronic equipment, storage medium
Technical field
The present invention relates to field of communication technology, more particularly to a kind of filtering sensitive words method, apparatus, electronic equipment, storage Medium.
Background technology
In current internet, there are various forums to leave a message and comment on for user, and million can be all generated daily in forum The grade even other model of millions and comment, and wherein inevitably generate and do not meet mainstream values and against social morality Message or comment generate, how to accomplish that the comment for avoiding these bad is issued, how to accomplish after sending out quickly Cleaning, become the problem that each webmaster must face.
Current mainstream solution is the participle used to text, then word segmentation result is compared sensitive dictionary, from And whether judge in text comprising specified sensitive word.It information is obtained, raps off, lie fallow however, having become people in network Today of the Important Platform of amusement, fresh network words are emerged in large numbers as emerging rapidly in large numbersBamboo shoots after a spring rain, and segment the dictionary that tool is relied on It cannot accomplish real-time update, this may result in the inaccuracy of word segmentation result, cannot be represented for example, some network words are divided into The vocabulary of its correct meaning or even some vocabulary can be by the separations of mistake, to also result in the accurate of filtering sensitive words indirectly Degree.
Invention content
In view of this, the purpose of the present invention is to provide a kind of filtering sensitive words method, apparatus, electronic equipment, storages to be situated between Matter can improve the accuracy of filtering sensitive words.
In order to achieve the above object, the present invention provides following technical solutions:
A kind of filtering sensitive words method is pre-configured with and loads sensitive dictionary, and reloaded when sensitive dictionary changes Sensitive dictionary, this method include:
Obtain text input by user;
Text input by user is traversed based on preset stride range;
In ergodic process, for each text-string traversed, by the sensitive word of text character string and load All sensitive words in library are matched, if be successfully matched in sensitive dictionary a sensitive word, it is determined that user inputs Text in there are sensitive word, terminate the traversing operation to text input by user, and forbid issuing text input by user;
After traversal, if it is determined that all text-strings traversed not with the sensitive word in sensitive dictionary With success, it is determined that sensitive word is not present in text input by user, allows to issue text input by user.
A kind of filtering sensitive words device, including:Configuration module, text input module, real-time text filtering module;
The configuration module for configuring and loading sensitive dictionary, and reloads sensitive word when sensitive dictionary changes Library;Stride range for presetting text traversal;
The text input module, for obtaining text input by user;
The real-time text filtering module, including the first traversal submodule, the first matched sub-block;Wherein,
The first traversal submodule, obtains text input module for being based on the preset stride range of configuration module The text input by user taken is traversed;
First matched sub-block, in the first traversal submodule in the ergodic process of text input by user, For each text-string traversed, by all sensitive words progress in text character string and the sensitive dictionary of load Match, if be successfully matched in sensitive dictionary a sensitive word, it is determined that there are sensitive words in text input by user, terminate First traverses traversing operation of the submodule to text input by user, and forbids issuing text input by user;For traversing Submodule is to after the traversal of text input by user, if it is determined that all text-strings traversed not with sensitive word Sensitive word successful match in library, it is determined that sensitive word is not present in text input by user, allows to issue text input by user This.
A kind of electronic equipment, including:At least one processor, and be connected by bus at least one processor Memory;The memory is stored with the one or more computer programs that can be executed by least one processor;Institute It states when at least one processor executes one or more of computer programs and realizes above-mentioned method and step.
A kind of computer readable storage medium, the computer-readable recording medium storage one or more computer journey Sequence, one or more of computer programs realize above-mentioned method when being executed by processor.
As can be seen from the above technical solution, in the present invention, for text input by user, based on preset stride range into Row traversal, and in ergodic process, the character string traversed is compared with the sensitive word in sensitive dictionary, to judge to use It whether there is sensitive word in the text of family input, and then judge that result determines whether to issue text input by user accordingly. In the present invention, the way that sensitive word is matched after using dictionary to carry out cutting word in the prior art is abandoned, directly in sensitive dictionary Word matched the method to determine whether there is sensitive word with the be possible to vocabulary in text, be adapted to sensitive word The newer demand of real-time update, effectively improves the accuracy of filtering sensitive words.
Description of the drawings
Fig. 1 is the configuration diagram of filtering sensitive words system of the embodiment of the present invention;
Fig. 2 is filtering sensitive words method flow diagram of the embodiment of the present invention;
Fig. 3 is the structural schematic diagram of filtering sensitive words device of the embodiment of the present invention;
Fig. 4 is the structural schematic diagram of electronic equipment of the embodiment of the present invention.
Specific implementation mode
In order to make the purpose , technical scheme and advantage of the present invention be clearer, below in conjunction with the accompanying drawings and according to embodiment, Technical scheme of the present invention is described in detail.
In the present invention, it would be desirable to the text for carrying out filtering sensitive words is divided into two kinds of texts, submitted immediately one is user/defeated The text entered, such as the information issued in forum, another kind are the history texts that filtering sensitive words had been carried out in publication This.It is parallel to execute for the filtering sensitive words of two kinds of texts, it is non-interference, but use identical sensitive dictionary.In sensitive dictionary Sensitive word can update at any time, when carrying out filtering sensitive words to text, be always filtered using newest sensitive dictionary, To ensure the accuracy of filtering sensitive words.
It is the configuration diagram of filtering sensitive words system of the embodiment of the present invention referring to Fig. 1, Fig. 1, as shown in Figure 1, sensitive word Filtration system includes text input module, history text filtering module and real time filtering mould module, wherein
Text input module receives text input by user, and text input by user is led and is sent to real time filtering module;
Real time filtering module to text input by user based on the sensitive word presence or absence of newest sensitive dictionary sentence It is disconnected, and judging result is exported, while there is no sensitive word, being stored text input by user as history text To history text filtering module;
History text filtering module receives task list input by user, and task list includes one or more tasks, Each task specifies multiple history texts to be filtered;History text filtering module periodically obtains these task lists, to task Each of include that history text to be filtered carries out the sensitive word presence or absence based on newest sensitive dictionary in each task in inventory Judge, for being judged as the history text to be filtered there are sensitive word, pending mark or the corresponding mark of record can be set Information, in order to subsequently carry out deletion or other processing to these history texts, and there are sensitive words for not being judged as History text to be filtered can continue to be stored in history text filtering module.
History text to be filtered is specified for the ease of user, text categories division, text class can be carried out to history text The target that text can be inputted with user issues the category division of forum, such as the forum, main for issuing entertainment information Forum for issuing political point view, forum, forum on educating children etc. for issuing economic information, in this way, inputting text in user When, the classification that user can be inputted to the target publication forum of text is determined as the classification that user inputs text, in this way, working as user The text of input can simultaneously go through this when being stored in history filter module as history text after implementing the filtering of filtering module Corresponding text categories on history text mark, to which user is in incoming task inventory, it is possible to specify by the history of a certain classification Text is filtered, at this point it is possible to identify multiple history texts to be filtered that each task includes with text categories.
The filtering sensitive words process of text of the embodiment of the present invention is described in detail below:
Pretreatment before text filtering:
In the present invention, for text input by user, pretreatment can be first carried out, mainly use it is preset at least One character converter converts text input by user into line character.
Preset at least one character converter may include font converter, alphabetical converter, digital quantizer, One or more of punctuation mark converter.Wherein,
Font converter is in order to which Chinese font is uniformly converted to the complex form of Chinese characters or is uniformly converted to simplified Chinese character.Due to traditional font There is one-to-one relationship, therefore, using font converter to text input by user into line character between word and simplified Chinese character When conversion, the Chinese character in text input by user can be carried out according to the correspondence between the complex form of Chinese characters and simplified Chinese character numerous Body is to simplified conversion, or carries out the simplified conversion to traditional font.To the complex form of Chinese characters is then uniformly converted to, still uniformly it is converted into Simplified Chinese character can be preset.
Alphabetical converter be in order to which English alphabet is unified upper or unified lower, due to One-to-one relationship between capitalization and lowercase, therefore, using alphabetical converter to text input by user into It, can be according to the correspondence between capitalization and lowercase to the English in text input by user when line character is converted Letter is capitalized the conversion to small letter, or carries out conversion of the small letter to capitalization;To upper is then unified, also It is uniformly to be converted into lowercase, can be preset.
Digital quantizer is in order to which numeral character is uniformly converted to corresponding standard gum number, due to table The character of registration word, such as word figure such as one, one, for bullets number as 1., Roman number such as I etc. has all corresponded to one Specific standard gum number therefore, can be by when being converted into line character to text input by user using digital quantizer According to numeral character and standard gum number correspondence by the numeral character in text input by user Be converted to standard gum number.
Punctuation mark converter be the character to being unable to representation language meaning in text carry out be based on preset word The conversion of transformation rule is accorded with, it therefore, can be by when being converted into line character to text input by user using punctuation mark converter The punctuation mark in text input by user is converted according to preset character transformation rule;Wherein, preset Character transformation rule may include the rule converted to DBC case of SBC case, space character converted to null character rule, The rule etc. that rule that newline tab is converted to null character, HTML markup are converted to null character.
Filtering sensitive words:
In the present invention, text is traversed based on preset stride range, by by the character string traversed with Sensitive word in sensitive dictionary carries out filtering sensitive words of the matching realization to text.
The stride range that the when of being traversed to text uses, such as stride ranging from [2,6] are set first, then walked in ergodic process The value of width includes 2,3,4,5,6.It should be noted that the stride value in stride range can also be discontinuous.
Secondly, text is traversed based on pre-set stride range, such as " China is one for one section of text Country with long history culture ", ergodic process is as follows:
First since first character, traversed using each stride value within the scope of default stride, such as stride model In the case of enclosing for [2,6], 5 character strings can be traversed:" China ", " China is ", " China is one ", " China is one It is a ", " China is one and goes through " match this 5 character strings traversed using the word in sensitive dictionary, it is assumed that at Work(is matched to one of character string, such as " China ", it may be considered that there are sensitive words in this section of text, if 5 characters String is not matched to the sensitive word in sensitive dictionary, then,
Again since the 2nd character, traversed using each stride value within the scope of default stride, can traverse as Lower 5 character strings:" state is ", " state is one ", " state is one ", " state is one and goes through ", " state is a history ", for traversing This 5 character strings, matched using the word in sensitive dictionary, it is assumed that be successfully matched to one of character string, such as " state One ", it may be considered that there are sensitive words in this section of text, if this 5 character strings be not matched to it is quick in sensitive dictionary Feel word, then,
Again since the 3rd character, traversed using each stride value within the scope of default stride, and so on.
It should be noted that in above process, stride is set as base unit using the length of a Chinese character It sets, in practical applications, a Chinese character is encoded using two GBK, because an English alphabet only takes up one Code length, therefore it is to produce the base unit of stride that length can also be encoded to a GBK.
In the above-mentioned ergodic process to text, for each sensitive word traversed, newest sensitive dictionary is all used In sensitive word matched, if being matched to any sensitive word, it is determined that there are sensitive word in text, money is calculated to save Source, subsequent text need not be traversed again, can directly terminate ergodic process.
After text traverses, if all character strings traversed are mismatched with the sensitive word in sensitive dictionary, It can determine and sensitive word is not present in text.
Filtering sensitive words principle of the present invention is described in detail above, is based on above-mentioned principle, the present invention provides one Kind filtering sensitive words method and a kind of filtering sensitive words device, are described in detail below in conjunction with Fig. 2 and Fig. 3:
Filtering sensitive words method flow diagram of the embodiment of the present invention referring to Fig. 2, Fig. 2, in this method, need to be pre-configured with and The sensitive dictionary of load, and sensitive dictionary is reloaded when sensitive dictionary changes, as shown in figure 3, this method includes mainly following Step:
Step 201, obtaining text input by user, (text to be released, user can input text by text input interface This, after submission, can obtain text input by user from the background);
Step 202 traverses text input by user based on preset stride range;In ergodic process, For each text-string traversed, by all sensitive words progress in text character string and the sensitive dictionary of load Match, if be successfully matched in sensitive dictionary a sensitive word, it is determined that there are sensitive words in text input by user, terminate To the traversing operation of text input by user, and forbid issuing text input by user;
Step 203, after traversal, if it is determined that all text-strings traversed not in sensitive dictionary Sensitive word successful match, it is determined that sensitive word is not present in text input by user, allows to issue text input by user.
In method shown in Fig. 2,
Before being traversed to text input by user based on preset stride range, further comprise:Using pre- At least one character converter first set converts text input by user into line character.
In method shown in Fig. 2,
Preset at least one character converter include font converter, alphabetical converter, digital quantizer, One or more of punctuation mark converter;
When at least one character converter includes font converter, using font converter to text input by user This includes into line character conversion:According to the correspondence between the complex form of Chinese characters and simplified Chinese character to the middle word in text input by user Symbol carries out traditional font to simplified conversion, or carries out the simplified conversion to traditional font;
When at least one character converter includes alphabetical converter, using alphabetical converter to text input by user This includes into line character conversion:According to the correspondence between capitalization and lowercase to the English in text input by user Word mother is capitalized the conversion to small letter, or carries out conversion of the small letter to capitalization;
When at least one character converter includes digital quantizer, using digital quantizer to text input by user This includes into line character conversion:According to the correspondence of word figure and project number number and standard gum number by user Word figure and project number number in the text of input are converted to standard gum number;
When at least one character converter includes punctuation mark converter, using punctuation mark converter to user The text of input is converted into line character:According to preset character transformation rule to the punctuate in text input by user Symbol is converted;Preset character transformation rule includes rule, the space character that SBC case is converted to DBC case The rule that the rule converted to null character, the rule converted to null character of newline tab, HTML markup are converted to null character.
In method shown in Fig. 2,
Allow after issuing text input by user, further comprises:Text input by user is led as history text Enter history text library;
This method further comprises:
Task instruction input by user is obtained, task based access control indicates configuration task inventory;Here, user can pass through task Input interface incoming task indicates, after submission, can obtain task instruction input by user from the background, may include in task instruction Text categories to be filtered, can be added to task list by text categories to be filtered, and each text categories correspond to one and appoint Business, needs all history texts corresponding to text classification to be filtered;Or task instruction in may include multiple history Text Flag can identify multiple history texts in a task being added in task list, these history texts mark Corresponding history text is text to be filtered.
The task list of configuration is periodically acquired, the task list includes at least one task, and each task includes Multiple history texts to be filtered that user specifies;
Each of include history text to be filtered to each task in task list, based on preset stride range to this History text to be filtered is traversed;
In ergodic process, for each text-string traversed, by the sensitive word of text character string and load All sensitive words in library are matched, if be successfully matched in sensitive dictionary a sensitive word, it is determined that this is to be filtered There are sensitive word in history text, the traversing operation to the history text to be filtered is terminated, and by the history text mark to be filtered It is denoted as pending;
After traversal, if it is determined that all text-strings traversed not with the sensitive word in sensitive dictionary With success, it is determined that sensitive word is not present in the history text to be filtered, continuation retains this in history text library and to be filtered goes through History text.
In method shown in Fig. 2,
All history texts to be filtered that each task in task list includes are executed and are based on preset stride range Traversal after, further comprise:It is deleted for marking the history text for being to execute batch in history text library.
In method shown in Fig. 2,
When obtaining text input by user, the text categories of text input by user are further determined that;
When text input by user is imported history text library as history text, further marked in history text library The text categories of the history text;
The task list is specified by user, multiple history texts to be filtered that each task includes in task list by with The specified text categories mark in family, the corresponding all history texts to be filtered of text categories specified for user.
In method shown in Fig. 2,
Preset stride range includes multiple strides;
The method of the traversal carried out to text based on preset stride range is:
It is begun stepping through from the first character of text, including:By since the first character of text with the stride The character string that the isometric continuation character string of each stride is traversed as one in range, by the sensitive word of the character string and load All sensitive words in library are matched, and successful match then terminates traversal;When judgement since the first character of text with When the isometric continuation character string of all strides is mismatched with all sensitive words in sensitive dictionary in the stride range,
It is begun stepping through from second character of text, including:By since the second of text character with the stride The character string that the isometric continuation character string of each stride is traversed as one in range, by the sensitive word of the character string and load All sensitive words in library are matched, and successful match then terminates traversal;When judgement since second character of text with When the isometric continuation character string of all strides is mismatched with all sensitive words in sensitive dictionary in the stride range,
It is begun stepping through from the third character of text, and so on.
It is the structural schematic diagram of filtering sensitive words device of the embodiment of the present invention referring to Fig. 3, Fig. 3, as shown in figure 3, the device Including:Configuration module 301, text input module 302, real-time text filtering module 303;Wherein,
Configuration module 301 for configuring and loading sensitive dictionary, and reloads sensitive word when sensitive dictionary changes Library;Stride range for presetting text traversal;
Text input module 302, for obtaining text input by user;
Real-time text filtering module 303, including the first traversal submodule 3031, the first matched sub-block 3032;Wherein,
The first traversal submodule 3031, for defeated to text based on 301 preset stride range of configuration module The text input by user for entering module acquisition is traversed;
First matched sub-block 3032, for the traversal in the first traversal submodule 3031 to text input by user In the process, for each text-string traversed, by all sensitivities in text character string and the sensitive dictionary of load Word is matched, if be successfully matched in sensitive dictionary a sensitive word, it is determined that there are quick in text input by user Feel word, terminates traversing operation of the first traversal submodule 3031 to text input by user, and forbid issuing text input by user This;For after traversing traversal of the submodule to text input by user, if it is determined that all text characters traversed String not with the sensitive word successful match in sensitive dictionary, it is determined that in text input by user be not present sensitive word, allow to send out Cloth text input by user.
In Fig. 3 shown devices, the real-time text filtering module 303 further includes transform subblock 3033;
The configuration module 301 is further used for being pre-configured at least one character converter;
The transform subblock 3033, for first traversal submodule 3031 be based on preset stride range to Before the text of family input is traversed, text input by user is carried out using preset at least one character converter Character is converted.
In Fig. 3 shown devices,
The preset at least one character converter of the configuration module 301 include font converter, alphabetical converter, One or more of digital quantizer, punctuation mark converter;
When at least one character converter includes font converter, the transform subblock 3033 is turned using font Parallel operation converts text input by user into line character:It is defeated to user according to the correspondence between the complex form of Chinese characters and simplified Chinese character Chinese character in the text entered carries out traditional font to simplified conversion, or carries out the simplified conversion to traditional font;
When at least one character converter includes alphabetical converter, the transform subblock 3033 is turned using letter Parallel operation converts text input by user into line character:According to the correspondence between capitalization and lowercase to English alphabet in the text of family input is capitalized the conversion to small letter, or carries out conversion of the small letter to capitalization;
When at least one character converter includes digital quantizer, the transform subblock 3033 is turned using number Parallel operation converts text input by user into line character:According to word figure and project number number and standard gum number The correspondence of word by text input by user word figure and project number number be converted to standard gum number;
When at least one character converter includes punctuation mark converter, the transform subblock 3033 uses mark Point symbol converter converts text input by user into line character:According to preset character transformation rule to user Punctuation mark in the text of input is converted;Preset character transformation rule includes that SBC case turns to DBC case Rule that rule that the rule changed, space character are converted to null character, newline tab are converted to null character, HTML markup to The rule of null character conversion.
Further include history text filtering module 304 in Fig. 3 shown devices;
First matched sub-block 3031 allows after issuing text input by user, is further used for:User is defeated The text entered imports history text library as history text;
The configuration module 301, the task for obtaining user indicate that task based access control indicates configuration task inventory;
The history text filtering module 304, including task submodule 3041, second traverse submodule 3042, second Sub-module 3043;Wherein,
The task submodule 3041, the task list for periodically acquiring the configuration of configuration module 301, the task are clear List includes at least one task, and each task includes multiple history texts to be filtered that user specifies in history text library;
The second traversal submodule 3042, for each of including history text to be filtered to each task in task list This, traverses the history text to be filtered based on preset stride range;
Second matched sub-block 3043, for traversing submodule 3042 to each history text to be filtered second It, will be all in text character string and the sensitive dictionary of load for each text-string traversed in ergodic process Sensitive word is matched, if be successfully matched in sensitive dictionary a sensitive word, it is determined that in the history text to be filtered There are sensitive word, traversing operation of the second traversal submodule 3042 to the history text to be filtered is terminated, and this to be filtered is gone through History text mark is pending;After traversal, if it is determined that all text-strings traversed not with sensitive dictionary In sensitive word successful match, it is determined that in the history text to be filtered be not present sensitive word, continuation protected in history text library Stay the history text to be filtered.
In Fig. 3 shown devices, the history text filtering module 304 further includes result treatment submodule 3044;
The result treatment submodule 3044, traverse submodule 3042 for second includes to each task in task list After all history texts to be filtered execute the traversal based on preset stride range, it is for label in history text library The history text of processing executes batch and deletes.
In Fig. 3 shown devices,
The text input module 302 when obtaining text input by user, further determines that the text of text input by user This classification;
First matched sub-block 3031, when text input by user is imported history text library as history text, The text categories of the history text are further marked in history text library;
The task list is specified by user, multiple history texts to be filtered that each task includes in task list by with The specified text categories mark in family, the corresponding all history texts to be filtered of text categories specified for user.
In Fig. 3 shown devices,
301 preset stride range of the configuration module includes multiple strides;
The first traversal submodule 3031 and second traverses submodule 3042, is based on preset stride range To text carry out traversal when, be used for:
It is begun stepping through from the first character of text, including:By since the first character of text with the stride The character string that the isometric continuation character string of each stride is traversed as one in range, will be in the character string and sensitive dictionary All sensitive words are matched, and successful match then terminates traversal;When judgement since the first character of text with the step When the isometric continuation character string of all strides is mismatched with all sensitive words in sensitive dictionary in width range,
It is begun stepping through from second character of text, including:By since the second of text character with the stride The character string that the isometric continuation character string of each stride is traversed as one in range, will be in the character string and sensitive dictionary All sensitive words are matched, and successful match then terminates traversal;When judgement since second character of text with the step When the isometric continuation character string of all strides is mismatched with all sensitive words in sensitive dictionary in width range,
It is begun stepping through from the third character of text, and so on.
Referring to Fig. 4, another embodiment of the present invention additionally provides a kind of electronic equipment, function and device phase as shown in Figure 3 Together, electronic equipment shown in Fig. 4 includes:At least one processor 401, and pass through bus phase at least one processor Memory 402 even;The memory 402 is stored with the one or more computers that can be executed by least one processor Program;Method as shown in Figure 2 is realized when at least one processor 401 executes one or more of computer programs Step.
The present invention also provides a kind of computer readable storage medium, the computer-readable recording medium storage one or Multiple computer programs, one or more of computer programs realize method as shown in Figure 2 when being executed by processor.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention With within principle, any modification, equivalent substitution, improvement and etc. done should be included within the scope of protection of the invention god.

Claims (16)

1. a kind of filtering sensitive words method, which is characterized in that be pre-configured with and load sensitive dictionary, and when sensitive dictionary changes Sensitive dictionary is reloaded, this method includes:
Obtain text input by user;
Text input by user is traversed based on preset stride range;
It, will be in text character string and the sensitive dictionary of load for each text-string traversed in ergodic process All sensitive words matched, if be successfully matched in sensitive dictionary a sensitive word, it is determined that text input by user There are sensitive words in this, terminate the traversing operation to text input by user, and forbid issuing text input by user;
After traversal, if it is determined that all text-strings traversed are not matched into the sensitive word in sensitive dictionary Work(, it is determined that sensitive word is not present in text input by user, allows to issue text input by user.
2. according to the method described in claim 1, it is characterized in that,
Before being traversed to text input by user based on preset stride range, further comprise:Using setting in advance Fixed at least one character converter converts text input by user into line character.
3. according to the method described in claim 2, it is characterized in that,
Preset at least one character converter includes font converter, alphabetical converter, digital quantizer, punctuate One or more of signal converter;
When at least one character converter includes font converter, using font converter to text input by user into Line character is converted:According to the correspondence between the complex form of Chinese characters and simplified Chinese character to the Chinese character in text input by user into Row traditional font is to simplified conversion, or carries out the simplified conversion to traditional font;
When at least one character converter includes alphabetical converter, using alphabetical converter to text input by user into Line character is converted:According to the correspondence between capitalization and lowercase to the English words in text input by user Mother is capitalized the conversion to small letter, or carries out conversion of the small letter to capitalization;
When at least one character converter includes digital quantizer, using digital quantizer to text input by user into Line character is converted:User is inputted according to word figure and project number number and the correspondence of standard gum number Text in word figure and project number number be converted to standard gum number;
When at least one character converter includes punctuation mark converter, user is inputted using punctuation mark converter Text into line character conversion include:According to preset character transformation rule to the punctuation mark in text input by user It is converted;Preset character transformation rule include SBC case converted to DBC case rule, space character is to sky The rule that the rule of character conversion, the rule converted to null character of newline tab, HTML markup are converted to null character.
4. according to the method described in claim 1, it is characterized in that,
Allow after issuing text input by user, further comprises:It is gone through text input by user as history text importing History text library;
This method further comprises:
The task instruction of user is obtained, task based access control indicates configuration task inventory;
The task list of configuration is periodically acquired, the task list includes at least one task, and each task includes using Multiple history texts to be filtered that family is specified in history text library;
Each of include history text to be filtered to each task in task list, this was waited for based on preset stride range Filter history text is traversed;
It, will be in text character string and the sensitive dictionary of load for each text-string traversed in ergodic process All sensitive words matched, if be successfully matched in sensitive dictionary a sensitive word, it is determined that the history to be filtered There are sensitive words in text, terminate the traversing operation to the history text to be filtered, and the history text to be filtered is labeled as It is pending;
After traversal, if it is determined that all text-strings traversed are not matched into the sensitive word in sensitive dictionary Work(, it is determined that sensitive word is not present in the history text to be filtered, continuation retains the history text to be filtered in history text library This.
5. according to the method described in claim 4, it is characterized in that,
Time based on preset stride range is executed to all history texts to be filtered that each task in task list includes After going through, further comprise:It is deleted for marking the history text for being to execute batch in history text library.
6. according to the method described in claim 5, it is characterized in that,
When obtaining text input by user, the text categories of text input by user are further determined that;
When text input by user is imported history text library as history text, further this is marked to go through in history text library The text categories of history text;
The task list is specified by user, and multiple history texts to be filtered that each task includes in task list are referred to by user Fixed text categories mark, the corresponding all history texts to be filtered of text categories specified for user.
7. method according to claim 1 or 4, which is characterized in that
Preset stride range includes multiple strides;
The method of the traversal carried out to text based on preset stride range is:
It is begun stepping through from the first character of text, including:By since the first character of text with the stride range In the character string that is traversed as one of each isometric continuation character string of stride, will be in the character string and the sensitive dictionary of load All sensitive words matched, successful match then terminates traversal;When judgement since the first character of text with it is described When the isometric continuation character string of all strides is mismatched with all sensitive words in sensitive dictionary in stride range,
It is begun stepping through from second character of text, including:By since the second of text character with the stride range In the character string that is traversed as one of each isometric continuation character string of stride, will be in the character string and the sensitive dictionary of load All sensitive words matched, successful match then terminates traversal;When judgement since second character of text with it is described When the isometric continuation character string of all strides is mismatched with all sensitive words in sensitive dictionary in stride range,
It is begun stepping through from the third character of text, and so on.
8. a kind of filtering sensitive words device, which is characterized in that the device includes:Configuration module, text input module, real-time text Filtering module;
The configuration module for configuring and loading sensitive dictionary, and reloads sensitive dictionary when sensitive dictionary changes;With In the stride range for presetting text traversal;
The text input module, for obtaining text input by user;
The real-time text filtering module, including the first traversal submodule, the first matched sub-block;Wherein,
The first traversal submodule, for what is obtained to text input module based on the preset stride range of configuration module Text input by user is traversed;
First matched sub-block, in the first traversal submodule in the ergodic process of text input by user, for The each text-string traversed matches all sensitive words of the text character string with load in sensitive dictionary, If a sensitive word being successfully matched in sensitive dictionary, it is determined that there are sensitive word in text input by user, terminate the One traverses traversing operation of the submodule to text input by user, and forbids issuing text input by user;For in traversal Module is to after the traversal of text input by user, if it is determined that all text-strings traversed not with sensitive dictionary In sensitive word successful match, it is determined that in text input by user be not present sensitive word, allow to issue text input by user.
9. device according to claim 8, which is characterized in that the real-time text filtering module further includes conversion submodule Block;
The configuration module is further used for being pre-configured at least one character converter;
The transform subblock, for being based on preset stride range to text input by user in the first traversal submodule Before being traversed, text input by user is converted into line character using preset at least one character converter.
10. device according to claim 9, which is characterized in that
The preset at least one character converter of configuration module includes font converter, alphabetical converter, number turn One or more of parallel operation, punctuation mark converter;
When at least one character converter includes font converter, the transform subblock using font converter to The text of family input is converted into line character includes:According to the correspondence between the complex form of Chinese characters and simplified Chinese character to text input by user In Chinese character carry out traditional font to simplified conversion, or carry out the simplified conversion to traditional font;
When at least one character converter includes alphabetical converter, the transform subblock using alphabetical converter to The text of family input is converted into line character includes:According to the correspondence between capitalization and lowercase to input by user English alphabet in text is capitalized the conversion to small letter, or carries out conversion of the small letter to capitalization;
When at least one character converter includes digital quantizer, the transform subblock using digital quantizer to The text of family input is converted into line character includes:It is corresponding with standard gum number according to word figure and project number number Relationship by text input by user word figure and project number number be converted to standard gum number;
When at least one character converter includes punctuation mark converter, the transform subblock is turned using punctuation mark Parallel operation converts text input by user into line character:According to preset character transformation rule to text input by user Punctuation mark in this is converted;Preset character transformation rule includes the rule that SBC case is converted to DBC case Then, rule that space character is converted to null character rule, newline tab are converted to null character, HTML markup are to null character The rule of conversion.
11. device according to claim 8, which is characterized in that the device further includes history text filtering module;
First matched sub-block allows after issuing text input by user, is further used for:By text input by user History text library is imported as history text;
The configuration module, the task for obtaining user indicate that task based access control indicates configuration task inventory;
The history text filtering module, including task submodule, the second traversal submodule, the second matched sub-block;Wherein,
The task submodule, for periodically acquire configuration module configuration task list, the task list include to A few task, each task include multiple history texts to be filtered that user specifies in history text library;
The second traversal submodule, for each of including history text to be filtered to each task in task list, based on pre- The stride range first set traverses the history text to be filtered;
Second matched sub-block, in the second traversal submodule in the ergodic process of each history text to be filtered, For each text-string traversed, by all sensitive words progress in text character string and the sensitive dictionary of load Match, if be successfully matched in sensitive dictionary a sensitive word, it is determined that there are sensitive words in the history text to be filtered, eventually Only second traversing operation of the submodule to the history text to be filtered is traversed, and the history text to be filtered is labeled as waiting locating Reason;After traversal, if it is determined that all text-strings traversed are not matched into the sensitive word in sensitive dictionary Work(, it is determined that sensitive word is not present in the history text to be filtered, continuation retains the history text to be filtered in history text library This.
12. according to the devices described in claim 11, which is characterized in that the history text filtering module further includes result treatment Submodule;
The result treatment submodule all to be filtered is gone through for the second traversal submodule to what each task in task list included After history text executes the traversal based on preset stride range, for marking the history for being text in history text library This execution batch is deleted.
13. device according to claim 12, which is characterized in that
The text input module when obtaining text input by user, further determines that the text categories of text input by user;
First matched sub-block further exists when text input by user is imported history text library as history text The text categories of the history text are marked in history text library;
The task list is specified by user, and multiple history texts to be filtered that each task includes in task list are referred to by user Fixed text categories mark, the corresponding all history texts to be filtered of text categories specified for user.
14. the device according to claim 8 or 11, which is characterized in that
The preset stride range of configuration module includes multiple strides;
The first traversal submodule and the second traversal submodule, carry out text based on preset stride range When traversal, it is used for:
It is begun stepping through from the first character of text, including:By since the first character of text with the stride range In the character string that is traversed as one of each isometric continuation character string of stride, will the character string with it is all in sensitive dictionary Sensitive word is matched, and successful match then terminates traversal;When judgement since the first character of text with the stride model When enclosing the isometric continuation character string of all strides and being mismatched with all sensitive words in sensitive dictionary,
It is begun stepping through from second character of text, including:By since the second of text character with the stride range In the character string that is traversed as one of each isometric continuation character string of stride, will the character string with it is all in sensitive dictionary Sensitive word is matched, and successful match then terminates traversal;When judgement since second character of text with the stride model When enclosing the isometric continuation character string of all strides and being mismatched with all sensitive words in sensitive dictionary,
It is begun stepping through from the third character of text, and so on.
15. a kind of electronic equipment, including:At least one processor, and be connected by bus at least one processor Memory;The memory is stored with the one or more computer programs that can be executed by least one processor;Its It is characterized in that, at least one processor realizes that claim 1-7 is any when executing one or more of computer programs Method and step described in claim.
16. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage is one or more Computer program is realized when one or more of computer programs are executed by processor described in any one of claim 1-7 Method.
CN201710119329.3A 2017-03-02 2017-03-02 A kind of filtering sensitive words method, apparatus, electronic equipment, storage medium Pending CN108536693A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710119329.3A CN108536693A (en) 2017-03-02 2017-03-02 A kind of filtering sensitive words method, apparatus, electronic equipment, storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710119329.3A CN108536693A (en) 2017-03-02 2017-03-02 A kind of filtering sensitive words method, apparatus, electronic equipment, storage medium

Publications (1)

Publication Number Publication Date
CN108536693A true CN108536693A (en) 2018-09-14

Family

ID=63488798

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710119329.3A Pending CN108536693A (en) 2017-03-02 2017-03-02 A kind of filtering sensitive words method, apparatus, electronic equipment, storage medium

Country Status (1)

Country Link
CN (1) CN108536693A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109492118A (en) * 2018-10-31 2019-03-19 北京奇艺世纪科技有限公司 A kind of data detection method and detection device
CN113157904A (en) * 2021-03-30 2021-07-23 北京优医达智慧健康科技有限公司 Sensitive word filtering method and system based on DFA algorithm
CN113221554A (en) * 2021-04-27 2021-08-06 北京字跳网络技术有限公司 Text processing method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090265339A1 (en) * 2006-04-12 2009-10-22 Lonsou (Beijing) Technologies Co., Ltd. Method and system for facilitating rule-based document content mining
CN101751461A (en) * 2009-12-30 2010-06-23 中兴通讯股份有限公司 Document conversion method and device
CN103678651A (en) * 2013-12-20 2014-03-26 Tcl集团股份有限公司 Sensitive word searching method and device
CN105574090A (en) * 2015-12-10 2016-05-11 北京中科汇联科技股份有限公司 Sensitive word filtering method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090265339A1 (en) * 2006-04-12 2009-10-22 Lonsou (Beijing) Technologies Co., Ltd. Method and system for facilitating rule-based document content mining
CN101751461A (en) * 2009-12-30 2010-06-23 中兴通讯股份有限公司 Document conversion method and device
CN103678651A (en) * 2013-12-20 2014-03-26 Tcl集团股份有限公司 Sensitive word searching method and device
CN105574090A (en) * 2015-12-10 2016-05-11 北京中科汇联科技股份有限公司 Sensitive word filtering method and system

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109492118A (en) * 2018-10-31 2019-03-19 北京奇艺世纪科技有限公司 A kind of data detection method and detection device
CN109492118B (en) * 2018-10-31 2021-04-16 北京奇艺世纪科技有限公司 Data detection method and detection device
CN113157904A (en) * 2021-03-30 2021-07-23 北京优医达智慧健康科技有限公司 Sensitive word filtering method and system based on DFA algorithm
CN113157904B (en) * 2021-03-30 2024-02-09 北京优医达智慧健康科技有限公司 Sensitive word filtering method and system based on DFA algorithm
CN113221554A (en) * 2021-04-27 2021-08-06 北京字跳网络技术有限公司 Text processing method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
TWI636452B (en) Method and system of voice recognition
WO2021068329A1 (en) Chinese named-entity recognition method, device, and computer-readable storage medium
CN110020422A (en) The determination method, apparatus and server of Feature Words
CN111814465A (en) Information extraction method and device based on machine learning, computer equipment and medium
WO2022033426A1 (en) Document processing method, document processing apparatus, and electronic device
CN108536693A (en) A kind of filtering sensitive words method, apparatus, electronic equipment, storage medium
US20220269354A1 (en) Artificial intelligence-based system and method for dynamically predicting and suggesting emojis for messages
CN114757176A (en) Method for obtaining target intention recognition model and intention recognition method
US10366142B2 (en) Identifier based glyph search
CN112446207A (en) Title generation method and device, electronic equipment and storage medium
CN101493727A (en) Natural participle and mixing input by statement input method
CN112528013A (en) Text abstract extraction method and device, electronic equipment and storage medium
CN105164669A (en) Information processing apparatus, information processing method, and program
WO2022121152A1 (en) Smart dialog method, apparatus, electronic device, and storage medium
WO2024051196A1 (en) Malicious code detection method and apparatus, electronic device, and storage medium
CN111475600A (en) Data governance method and device and computer readable storage medium
EP3719676A1 (en) Language processing method and device
CN103488616B (en) A kind of embedded font processing method and device
CN109840080B (en) Character attribute comparison method and device, storage medium and electronic equipment
CN117235345B (en) Open format document OFD searching method and device and electronic equipment
Goyvaerts Regular Expressions
CN114386407B (en) Word segmentation method and device for text
CN113360636B (en) Content display method, device, equipment and storage medium
US20220198127A1 (en) Enhancement aware text transition
CN107203276A (en) Input the control process method and device of information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180914