CN111368535B - Sensitive word recognition method, device and equipment - Google Patents

Sensitive word recognition method, device and equipment Download PDF

Info

Publication number
CN111368535B
CN111368535B CN201811603465.0A CN201811603465A CN111368535B CN 111368535 B CN111368535 B CN 111368535B CN 201811603465 A CN201811603465 A CN 201811603465A CN 111368535 B CN111368535 B CN 111368535B
Authority
CN
China
Prior art keywords
word
information
segmentation
group
information entropy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811603465.0A
Other languages
Chinese (zh)
Other versions
CN111368535A (en
Inventor
余建兴
余敏雄
余赢超
王焜
冯毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuhai Kingsoft Digital Network Technology Co Ltd
Original Assignee
Zhuhai Kingsoft Digital Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuhai Kingsoft Digital Network Technology Co Ltd filed Critical Zhuhai Kingsoft Digital Network Technology Co Ltd
Priority to CN201811603465.0A priority Critical patent/CN111368535B/en
Publication of CN111368535A publication Critical patent/CN111368535A/en
Application granted granted Critical
Publication of CN111368535B publication Critical patent/CN111368535B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Character Discrimination (AREA)

Abstract

The embodiment of the invention provides a sensitive word recognition method, a device and equipment, wherein the method comprises the following steps: determining context information corresponding to each word in the word segmentation; generating a word vector sequence of the word according to the semantic relation in the context information, wherein the word vector sequence comprises the word vector of the word and the word vector of the word in the context information; inputting the word vector sequence of the word into a recognition model which is trained in advance to obtain a recognition result of whether the word is a sensitive word or not; for the variant of the sensitive word, even if the font or the word sound is changed, the context semantic dependency relationship is unchanged, so that the variant of the sensitive word can be identified based on the context semantic dependency relationship in the scheme, and the identification effect is improved.

Description

Sensitive word recognition method, device and equipment
Technical Field
The present invention relates to the field of word processing technologies, and in particular, to a method, an apparatus, and a device for recognizing a sensitive word.
Background
In some internet scenarios, such as web forums, personal homepages, game chat, etc., users may post some text to express comments, express moods, or communicate with other users. In order to build a healthy network environment, it is generally required to audit the text content published by the user, that is, to identify whether the text content contains some sensitive words which do not meet the specification.
Existing sensitive word recognition schemes typically include: the method comprises the steps of obtaining text content published by a user, carrying out segmentation processing on the text content to obtain a plurality of segmented words, matching each segmented word with a pre-established sensitive vocabulary library, and if matching is successful, indicating that the segmented word is a sensitive word.
Currently, there are many variants of sensitive words, which are similar to the fonts or the sounds of the sensitive words, and in the above scheme, the variants cannot be identified, and the identification effect is poor.
Disclosure of Invention
The embodiment of the invention aims to provide a sensitive word recognition method, a device and equipment so as to improve recognition effect.
To achieve the above object, an embodiment of the present invention provides a method for identifying a sensitive word, including:
acquiring a text to be identified;
performing segmentation processing on the text to be identified to obtain a plurality of segmentation words;
for each word segment, determining the context information corresponding to each word in the word segment; generating a word vector sequence of the word according to the semantic relation in the context information, wherein the word vector sequence comprises the word vector of the word and the word vector of the word in the context information;
and inputting the word vector sequence of the segmented word into a recognition model which is obtained through training in advance, and obtaining a recognition result of whether the segmented word is a sensitive word.
Optionally, after the text to be recognized is obtained, the method further includes:
and carrying out any one or more of the following preprocessing on the text to be identified: cleaning characters, converting full angle into half angle, converting traditional Chinese to simplified Chinese, converting alphabetic writing, merging split characters and restoring harmonic characters to obtain a preprocessed text;
the text to be identified is segmented to obtain a plurality of segmentation words, which comprises the following steps:
and carrying out segmentation processing on the preprocessed text to obtain a plurality of segmentation words.
Optionally, after the text to be recognized is obtained, the method further includes:
iteratively intercepting character strings with preset lengths from the text to be identified;
matching each intercepted character string with a pre-established dictionary tree;
if there is a branch in the dictionary tree that matches the string, the string is identified as a sensitive word.
Optionally, the performing segmentation processing on the text to be identified to obtain a plurality of segmentation words includes:
calculating mutual information between every two adjacent words in the text to be recognized, wherein the mutual information represents the association degree between the adjacent words;
for each piece of mutual information greater than a preset association threshold, forming a candidate binary group by two adjacent words corresponding to the mutual information;
And calculating the information entropy of each candidate binary group, and carrying out segmentation processing on the text to be identified according to the calculated information entropy to obtain a plurality of segmentation words.
Optionally, the segmenting the text to be identified according to the information entropy obtained by calculation to obtain a plurality of segmentation words includes:
judging whether the information entropy of each candidate binary group is larger than a preset probability threshold value or not according to each candidate binary group;
if the candidate binary group is larger than the word segmentation value, determining the candidate binary group as a word segmentation;
if not, expanding the candidate binary group to obtain a multi-group, and judging whether the information entropy of the multi-group is larger than the preset probability threshold; if so, the multi-element group is determined to be a word.
Optionally, the information entropy includes left information entropy and right information entropy; the judging whether the information entropy of the candidate binary group is larger than a preset probability threshold value comprises the following steps:
judging whether the left information entropy and the right information entropy of the candidate binary group are both larger than a preset probability threshold value;
expanding the candidate binary group to obtain a multi-group, and judging whether the information entropy of the multi-group is larger than the preset probability threshold or not, wherein the method comprises the following steps:
under the condition that the left information entropy is not greater than the preset probability threshold, expanding the candidate binary group leftwards to obtain a left expanded multi-group, and judging whether the information entropy of the left expanded multi-group is greater than the preset probability threshold;
Under the condition that the right information entropy is not greater than the preset probability threshold, expanding the candidate binary group rightward to obtain a right expanded multi-group, and judging whether the information entropy of the right expanded multi-group is greater than the preset probability threshold;
said determining said plurality of groups as a word segment comprising:
determining the left expanded multi-tuple as a word segmentation under the condition that the information entropy of the left expanded multi-tuple is larger than the preset probability threshold;
and under the condition that the information entropy of the right expanded multi-element group is larger than the preset probability threshold value, determining the right expanded multi-element group as a word segmentation.
Optionally, after the obtaining the left expanded multi-tuple, the method further includes:
judging whether the length of the left expanded multi-element group reaches a preset length threshold value or not;
if not, executing the step of judging whether the information entropy of the left expanded multi-element group is larger than the preset probability threshold value;
after the right expanded multi-tuple is obtained, the method further comprises:
judging whether the length of the right expanded multi-element group reaches a preset length threshold value or not;
and if not, executing the step of judging whether the information entropy of the right expanded multi-element group is larger than the preset probability threshold value.
Optionally, the determining the context information corresponding to each word in the word segment includes:
for each word in the word segmentation, acquiring stroke information of the word;
performing characteristic numerical processing on the stroke information to obtain a multi-element characteristic sequence of the character;
and inputting the multi-element feature sequence of the word into a mapping model obtained through preset training to obtain the context information corresponding to the word.
Optionally, the generating the word vector sequence of the word according to the semantic relation in the context information includes:
determining a context semantic dependency of the word, the dependency comprising any one or more of: the above, below, the fragments to which they pertain;
and generating a word vector sequence of the segmented word according to the context semantic dependency relationship of the segmented word by using a long sequence coding algorithm.
Optionally, training to obtain the identification model by adopting the following steps:
obtaining a training sample, the training sample comprising: a word vector sequence consisting of word vectors of a plurality of continuous word segments and classification results corresponding to the plurality of continuous word segments;
inputting the training sample into a classification model of a preset structure;
recording order information of the word vector sequence by utilizing a sequence memory unit in the classification model;
Generating a prediction signal based on the sequence by the classification model through a mean value collection strategy;
and iteratively adjusting parameters of the sequence memory unit based on the prediction signals and the classification results to obtain a recognition model after training.
In order to achieve the above object, an embodiment of the present invention further provides a sensitive word recognition device, including:
the acquisition module is used for acquiring the text to be identified;
the segmentation module is used for carrying out segmentation processing on the text to be identified to obtain a plurality of segmentation words;
the determining module is used for determining context information corresponding to each word in each word segmentation aiming at each word segmentation;
the generation module is used for generating a word vector sequence of the word according to the semantic relation in the context information, wherein the word vector sequence comprises the word vector of the word and the word vector of the word in the context information;
the first recognition module is used for inputting the word vector sequence of the segmented word into a recognition model which is obtained through training in advance, and obtaining a recognition result of whether the segmented word is a sensitive word or not.
Optionally, the apparatus further includes:
the preprocessing module is used for preprocessing any one or more of the following texts to be recognized: cleaning characters, converting full angle into half angle, converting traditional Chinese to simplified Chinese, converting alphabetic writing, merging split characters and restoring harmonic characters to obtain a preprocessed text;
The segmentation module is specifically configured to: and carrying out segmentation processing on the preprocessed text to obtain a plurality of segmentation words.
Optionally, the apparatus further includes:
the second recognition module is used for iteratively intercepting character strings with preset lengths from the text to be recognized; matching each intercepted character string with a pre-established dictionary tree; if there is a branch in the dictionary tree that matches the string, the string is identified as a sensitive word.
Optionally, the segmentation module includes:
the computing sub-module is used for computing mutual information between every two adjacent words in the text to be recognized, wherein the mutual information represents the association degree between the adjacent words;
the composition sub-module is used for composing two adjacent words corresponding to each piece of mutual information which is larger than a preset association threshold value into a candidate binary group;
and the segmentation module is used for calculating the information entropy of each candidate binary group, and carrying out segmentation processing on the text to be identified according to the calculated information entropy to obtain a plurality of segmentation words.
Optionally, the segmentation submodule includes:
the judging unit is used for judging whether the information entropy of each candidate binary group is larger than a preset probability threshold value or not according to each candidate binary group; if the first determination unit is triggered, and if the first determination unit is not triggered, the second determination unit is triggered;
A first determining unit configured to determine the candidate binary group as a word segment;
the second determining unit is used for expanding the candidate binary groups to obtain a multi-group, and judging whether the information entropy of the multi-group is larger than the preset probability threshold value or not; if so, the multi-element group is determined to be a word.
Optionally, the information entropy includes left information entropy and right information entropy; the judging unit is specifically configured to: judging whether the left information entropy and the right information entropy of the candidate binary group are both larger than a preset probability threshold value;
the second determining unit is specifically configured to: under the condition that the left information entropy is not greater than the preset probability threshold, expanding the candidate binary group leftwards to obtain a left expanded multi-group, and judging whether the information entropy of the left expanded multi-group is greater than the preset probability threshold; determining the left expanded multi-tuple as a word segmentation under the condition that the information entropy of the left expanded multi-tuple is larger than the preset probability threshold; under the condition that the right information entropy is not greater than the preset probability threshold, expanding the candidate binary group rightward to obtain a right expanded multi-group, and judging whether the information entropy of the right expanded multi-group is greater than the preset probability threshold; and under the condition that the information entropy of the right expanded multi-element group is larger than the preset probability threshold value, determining the right expanded multi-element group as a word segmentation.
Optionally, the apparatus further includes:
the judging module is used for judging whether the length of the left expanded multi-element group reaches a preset length threshold value or not; if not, executing the step of judging whether the information entropy of the left expanded multi-element group is larger than the preset probability threshold value;
judging whether the length of the right expanded multi-element group reaches a preset length threshold value or not; and if not, executing the step of judging whether the information entropy of the right expanded multi-element group is larger than the preset probability threshold value.
Optionally, the determining module is specifically configured to:
for each word in the word segmentation, acquiring stroke information of the word;
performing characteristic numerical processing on the stroke information to obtain a multi-element characteristic sequence of the character;
and inputting the multi-element feature sequence of the word into a mapping model obtained through preset training to obtain the context information corresponding to the word.
Optionally, the generating module is specifically configured to:
determining a context semantic dependency of the word, the dependency comprising any one or more of: the above, below, the fragments to which they pertain;
and generating a word vector sequence of the segmented word according to the context semantic dependency relationship of the segmented word by using a long sequence coding algorithm.
Optionally, the apparatus further includes:
the model training module is used for obtaining training samples, and the training samples comprise: a word vector sequence consisting of word vectors of a plurality of continuous word segments and classification results corresponding to the plurality of continuous word segments; inputting the training sample into a classification model of a preset structure; recording order information of the word vector sequence by utilizing a sequence memory unit in the classification model; generating a prediction signal based on the sequence by the classification model through a mean value collection strategy; and iteratively adjusting parameters of the sequence memory unit based on the prediction signals and the classification results to obtain a recognition model after training.
In order to achieve the above object, an embodiment of the present invention further provides an electronic device, including a processor and a memory;
a memory for storing a computer program;
and the processor is used for realizing any sensitive word recognition method when executing the program stored in the memory.
To achieve the above object, an embodiment of the present invention further provides a computer readable storage medium, in which a computer program is stored, which when executed by a processor, implements any one of the above-mentioned sensitive word recognition methods.
When the embodiment of the invention is applied to the recognition of the sensitive words, the context information corresponding to each word in the word segmentation is determined; generating a word vector sequence of the word according to the semantic relation in the context information, wherein the word vector sequence comprises the word vector of the word and the word vector of the word in the context information; inputting the word vector sequence of the word into a recognition model which is trained in advance to obtain a recognition result of whether the word is a sensitive word or not; for the variant of the sensitive word, even if the font or the word sound is changed, the context semantic dependency relationship is unchanged, so that the variant of the sensitive word can be identified based on the context semantic dependency relationship in the scheme, and the identification effect is improved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a first flow of a method for recognizing a sensitive word according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a dictionary tree provided in an embodiment of the present invention;
FIG. 3 is a schematic diagram of a word segmentation determining process according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of another word segmentation determining process according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of an embedded neural network according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a classification model according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of an identification model according to an embodiment of the present invention;
FIG. 8 is a second flowchart of a method for recognizing a sensitive word according to an embodiment of the present invention;
FIG. 9 is a schematic structural diagram of a device for recognizing sensitive words according to an embodiment of the present invention;
fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In order to solve the above technical problems, embodiments of the present invention provide a method, an apparatus, and a device for recognizing a sensitive word, where the method and the apparatus may be applied to various electronic devices, and are not limited in particular. The method for identifying the sensitive words provided by the embodiment of the invention is first described in detail below. For convenience of description, the execution body will be referred to as an electronic device in the following embodiments.
Fig. 1 is a schematic flow chart of a method for identifying a sensitive word according to an embodiment of the present invention, including:
s101: and acquiring a text to be identified.
For example, the text to be identified may be text content published by the user in various scenes such as internet forums, personal homepages, game chat, etc. The text to be recognized may be chinese text, english text, japanese text, etc., and the specific language is not limited.
S102: and performing segmentation processing on the text to be identified to obtain a plurality of segmentation words.
As an embodiment, after S101, any one or more of the following preprocessing may be performed on the text to be recognized first: cleaning characters, converting full angle into half angle, converting traditional Chinese to simplified Chinese, converting alphabetic writing, merging split characters and restoring harmonic characters to obtain a preprocessed text; thus, S102 is: and carrying out segmentation processing on the preprocessed text to obtain a plurality of segmentation words.
For example, if the text to be recognized is chinese text, the character cleaning may be: and removing characters such as non-Chinese characters, english letters, chat expressions, "# - @ and # -, and the like.
Full angle to half angle: that is, the full-angle character is uniformly converted into the half-angle character, for example, "ABC 123" is converted into "ABC 123".
Complex body changes into simplified body: i.e. converting traditional Chinese characters into simplified words, e.g. "one pears year" into "one thousand years".
Pinyin-to-text: i.e., converting the letters of pinyin to simplified words, e.g., "yiwannian" to "one thousand years".
Splitting word combination: if the text to be recognized is Chinese text, split words are combined into normal Chinese characters, for example, the "vehicle's front" is converted into "wheels".
Harmonic word reduction: the harmonic characters are restored to normal Chinese characters, for example, the "jungle period" is converted to the "processor".
If the pretreatment is performed in the above-mentioned various ways, the order of the various ways is not limited.
There may be some sensitive word variants in the text to be recognized that are similar in shape or similar in pronunciation. For example, some users add characters such as non-Chinese characters, english letters, chat expressions, "# -"% @ and the like between sensitive words, or convert the sensitive words into full-angle characters, or convert the sensitive words into traditional Chinese characters, or convert the sensitive words into pinyin, or convert the sensitive words into split words, or convert the sensitive words into harmonic words, and the like. Through the pretreatment, the variants of some sensitive words can be reduced to the sensitive words, so that the interference of the variants of the sensitive words on recognition is reduced.
As an implementation manner, after S101, a character string with a preset length may be iteratively intercepted from the text to be recognized; matching each intercepted character string with a pre-established dictionary tree; if there is a branch in the dictionary tree that matches the string, the string is identified as a sensitive word.
For example, as shown in fig. 2, the dictionary tree may include a root node and a plurality of leaf nodes, where the root node may be null, or the root node may not include a character, and the leaf nodes include a character, where the characters from the root node to the end leaf nodes are connected into a character string, and the character string is a sensitive word. The root node to the end leaf node form branches of the dictionary tree, and it is understood that one branch corresponds to one character string, or one branch corresponds to one sensitive word. The character strings, dictionary tree structures, etc. in fig. 2 are merely examples, and are not limited to the dictionary tree structures.
And iteratively intercepting character strings with preset lengths from the text to be identified, wherein the preset lengths can be the maximum lengths of the sensitive words in the dictionary tree. For example, assuming that the preset length is 5 characters and assuming that the text to be recognized is "we are good friends", the character strings that are iteratively truncated may be respectively: we are good friends and friends. The text to be identified is typically a longer piece of text content, for illustration only.
The process of matching each cut character string with the dictionary tree is similar, and the following character string is exemplified:
the matching process is a process of traversing the dictionary tree, wherein the traversal starts from a root node, and each character in the character string is sequentially matched with each leaf node in each branch of the dictionary tree according to the direction in the dictionary tree (the direction from the root node to the tail leaf node); if the character in the character string is successfully matched with each leaf node in the branch, the branch is matched with the character string.
Specifically, if the number of characters in a string is greater than the number of leaf nodes in a branch, e.g., the string is "we are good and" we "and one branch of the dictionary tree is" we ", then the first two characters of the string are successfully matched with both leaf nodes of the branch, in which case the branch is indicated as being matched with the string. If the string matches a branch in the dictionary tree, it is indicated that the string is a sensitive word.
It will be appreciated that in this embodiment, some general-purpose sensitive words are stored in advance, and are stored in the form of a dictionary tree. The dictionary tree structure uses the common prefix (empty root node), so that the expenditure of the query time can be reduced, and the matching efficiency can be improved. In this embodiment, the character strings in the text to be recognized are matched with the dictionary tree, that is, the general sensitive words in the text to be recognized are recognized first, and then the sensitive words in the rest of the content are recognized by using the scheme of S102-S104.
If any one or more of the above preprocessing is performed on the text to be identified by adopting the above embodiment, and a preprocessed text is obtained, the present embodiment may be adopted to iteratively intercept a character string with a preset length from the preprocessed text, and match the character string with a pre-established dictionary tree. The specific matching process is similar and will not be described again.
As an embodiment, S102 may include: calculating mutual information between every two adjacent words in the text to be identified, wherein the mutual information represents the association degree between the adjacent words; for each piece of mutual information greater than a preset association threshold, forming a candidate binary group by two adjacent words corresponding to the mutual information; and calculating the information entropy of each candidate binary group, and carrying out segmentation processing on the text to be identified according to the calculated information entropy to obtain a plurality of segmentation words.
For example, the mutual information between every two adjacent words in the text to be recognized can be calculated using the following formula:
pmi=p (x, y)/p (x) ·p (y) =cnt (x, y)/cnt (x) ·cnt (y) (formula 1)
Wherein, the larger the PMI is, the stronger the relation degree between adjacent words is, p (x) and p (y) respectively represent the probability of occurrence of the event x and the event y, p (x, y) represents the probability of occurrence of the event x and the event y at the same time, and cnt represents a function of the statistical frequency.
The association threshold may be preset, and if the mutual information between two adjacent words is greater than the association threshold, it indicates that the association degree of the two adjacent words is stronger, and the two adjacent words are formed into a candidate binary group. The association threshold may be set according to practical situations, for example, may be 1, and specific numerical values are not limited. It can be understood that the association degree between each word in the sensitive word is strong, and candidate tuples with strong association degree can be screened out from the text to be identified, and the tuples consist of two words.
In this embodiment, the information entropy of each candidate binary group is calculated, where the information entropy represents the randomness degree of a text segment left-neighbor set and right-neighbor set. For a certain word, the more the matching words between the word and the word adjacent to the word from the left and the right are, the greater the probability that the word and the word adjacent to the word from the left and the right are belonging to different words is. For example, the information entropy of the candidate doublet may be calculated using the following formula:
wherein Entropy represents information Entropy, U represents candidate binary groups, i represents the identification of a word, and p i Representing the probability of the occurrence of a word identified as i and n representing the number of adjacent words.
For example, taking the word "eat grape but not spit grape skin" as an example, the word "grape" appears four times, wherein the left-neighbor words are { eat, spit, eat, spit }, and the right-neighbor words are { do not, skin, spit, skin }, respectively. Calculated according to equation 2: the information entropy of the left adjacent word of the word "grape" is- (1/2) & log (1/2), and the information entropy of the right adjacent word is- (1/2) & log (1/2) - (1/4) & log (1/4).
As an embodiment, for each candidate binary group, it may be determined whether the information entropy of the candidate binary group is greater than a preset probability threshold: if the candidate binary group is larger than the word segmentation value, determining the candidate binary group as a word segmentation; if not, other steps can be not executed, the candidate binary groups can be expanded to obtain a multi-group, and whether the information entropy of the multi-group is larger than the preset probability threshold value is judged; if so, the multi-element group is determined to be a word.
The probability threshold may be set according to actual situations, and specific numerical values are not limited.
In this case, a length threshold may be set, where the length threshold represents the maximum length of the word, and the length threshold may be set according to the actual situation, for example, may be set to 5 characters, and the specific numerical value is not limited. In this case, when the candidate binary group is expanded, the length of the expanded multi-group does not exceed the length threshold.
In one case, the information entropy of the candidate binary group includes left information entropy and right information entropy, that is, the information entropy of the left neighbor and the information entropy of the right neighbor in the above example. In this case, reference may be made to fig. 3:
s301: judging whether the left information entropy and the right information entropy of the candidate binary group are both larger than a preset probability threshold, executing S302-S304 if the left information entropy is not larger than the preset probability threshold, executing S305-S307 if the right information entropy is not larger than the preset probability threshold, and executing S308 if the left information entropy and the right information entropy are both larger than the preset probability threshold.
S302: and expanding the candidate binary group leftwards to obtain a left expanded multi-tuple.
S303: judging whether the information entropy of the left expanded multi-element group is larger than a preset probability threshold value or not; if so, S304 is performed.
S304: the left expanded multi-tuple is determined to be a word segment.
S305: and expanding the candidate binary group rightward to obtain a right-expanded multi-group.
S306: judging whether the information entropy of the right expanded multi-element group is larger than a preset probability threshold value or not; if so, S307 is performed.
S307: the right expanded multi-tuple is determined to be a word segment.
S308: the candidate tuple is determined to be a word segment.
In this case, after the extended multi-tuple is obtained, it may be first determined whether the extended multi-tuple reaches a preset length threshold, if so, no subsequent step may be performed, and if not, it is further determined whether the information entropy of the extended multi-tuple is greater than a preset probability threshold.
For example, referring to fig. 4, fig. 4 is added with S309 after S302 on the basis of fig. 3: judging whether the length of the left expanded multi-element group reaches a preset length threshold value or not; if the determination result is not reached in S309, S303 is executed again. Similarly, S310 is added after S305: and judging whether the length of the right expanded multi-element group reaches a preset length threshold value, and executing S306 again if the judgment result of S310 is that the length of the right expanded multi-element group does not reach the preset length threshold value.
It can be understood that the sensitive word cannot be infinitely long, if the information entropy is not larger than the probability threshold after the sensitive word is expanded to a certain length, the sensitive word is not expanded any more, unnecessary calculation amount is reduced, and the recognition efficiency is improved.
If the above embodiment is adopted to perform any one or more of the above preprocessing on the text to be identified to obtain the preprocessed text, the preprocessing text may be subjected to the segmentation processing according to the embodiment, and the specific segmentation process is similar and will not be repeated.
Or, other segmentation methods can be adopted, the text to be identified or the text after pretreatment is subjected to segmentation treatment, and the specific segmentation method is not limited.
S103: for each word segment, determining the context information corresponding to each word in the word segment.
The following description will be given by taking the processing of one word segmentation as an example: as an implementation manner, for each word in the word segmentation, stroke information of the word can be obtained; performing characteristic numerical processing on the stroke information to obtain a multi-element characteristic sequence of the character; and inputting the multi-element feature sequence of the word into a mapping model obtained through preset training to obtain the context information corresponding to the word.
For example, a word vector construction method based on the character characteristics of radicals, chinese character components and the like can be adopted, and the constructed word vector comprises context information corresponding to the word. These stroke features facilitate recognition of sensitive word variants that are close in character. Each word may be encoded based on stroke characteristics, e.g., chinese strokes may be divided into five categories:
stroke name Transverse bar Vertical column Skimming Right-falling stroke Hook
Shape and shape A first part I (I) (V) One (V)
ID 1 2 3 4 5
Specifically, the text to be recognized can be divided into individual words, and then the words are divided into strokes, so that the stroke information of the words is obtained. Feature-digitizing the stroke information may be understood as converting the stroke to an ID. And then combining the IDs according to the stroke order to obtain a multi-element characteristic sequence.
The multi-element feature sequence can be obtained by using an N-element window sliding mode. For example, after feature numerical processing is performed on the stroke information of "wandering" of the Chinese character, a sequence 1224443533,3-element feature sequence (3 strokes) is 122,4-element feature sequence (4 strokes) is 1224, and so on. The "multivariate feature sequence" as referred to herein may be an N-ary feature sequence, i.e., an N-ary encoding of a word. Taking n=5 as an example, the 5-membered feature sequence can be as follows:
122
224
...
1224
12244
...
And inputting the obtained multi-element feature sequence into a mapping model obtained by preset training to obtain a word vector carrying the context information.
For example, the embedded neural network of the preset structure may be trained to obtain the mapping model. Referring to fig. 5, the embedded neural network and the mapping model may include an input layer, a hidden layer, and an output layer, the dimension of the input layer may be V, and the dimension of the hidden layer may beWith N, the dimension of the output layer may be C x V, C representing half the context length. Taking a word as an example, the input layer inputs a multi-element feature sequence x which can be the word k The output layer outputs x k Is a word vector { x } carrying context information k-C ,x k-C-1 ,...,x k-1 ,x k+1 ,...,x k+C-1 ,x k+C Context length is 2C.
In fig. 5, a weight matrix W exists between the input layer and the hidden layer V×N ,W V×N The i-th line of (a) represents the weight of the i-th word in a vocabulary, wherein the vocabulary comprises common words, and the vocabulary can be obtained based on historical chat text statistics; the weight matrix W V×N Contains weight information of all words in vocabulary, the weight matrix W V×N The training process is a process of adjusting parameters embedded in the neural network and the mapping model.
There is an output vector W 'of dimension NXV between the hidden layer and the output layer' N×V . The hidden layer comprises N nodes, and the data obtained after the input layer is subjected to weighted summation is input to the nodes of the hidden layer.
For example, the output layer may share weights, such as generating a polynomial distribution of the C-th word by a softmax function, the polynomial including a probability of each word in the context corresponding to the C-th word with the C-th word, and in particular, the probability may be calculated by equation 3:
wherein y is c,j Representing a predictive conditional probability distribution of a j-th word in the context; p (w) c,j =w O,c |w I ) Representing for a selected word w I In terms of the conditional probability that the j-th word in the context is O, ": = "means" equivalent to ".
For example "this cucumber", the word "yellow" is selected, the words in the context "this", "small", "melon", the 2 nd contextThe conditional probability P that the word is "one" ("yellow" | ") is abbreviated in equation 3 as y_ (" yellow ", 2), i.e., j=2, indicating the 2 nd word of the context. The softmax function may also be expressed as an exp ()/sum (exp ()) function. u (u) c,j Representing a vector of the selected word C as input, transformed by the hidden layer, where the transformation from the input layer to the hidden layer is h=x T W=v w,I The change from hidden layer to output layer is u c,j =h.mu w,j
The process of training the resulting mapping model may include: obtaining training text corpus, which can not include labeling information, inputting the text corpus into an unregulated embedded neural network, and iteratively learning weight w by utilizing a mathematical optimal value algorithm such as a back propagation and random gradient descent algorithm; wherein the weight updating iterative formula from the hidden layer to the output layer can refer to the formula 4; the weights of the input layer to the hidden layer update the iterative formula reference 5.
Wherein w' (new) Representing updated weights from hidden layer to output layer, w' (old) Representing pre-update weights from hidden layer to output layer, η represents learning rate in training, y c,j Representing the predictive conditional probability distribution, t, of the jth word in the context c,j Representing the statistical frequency probability distribution of the j-th word in the context, h i The ith node representing the hidden layer is obtained by calculating y c,j And t c,j The model can be continuously optimized, so that the prediction probability output by the neural network is continuously close to the real statistical probability.
Wherein w is (new) Representing updated weights of input layer to hidden layer, w i,j Learning right representing the jth word in the ith given wordThe weight of the steel plate is increased, Representing the pre-update weight of the input layer to the hidden layer, η represents the learning rate in training, y c,j Representing the predictive conditional probability distribution, t, of the jth word in the context c,j Representing the statistical frequency probability distribution of the j-th word in the context, h i Represents the i-th node of the hidden layer, and V represents the number of words in the corpus.
After the iteration converges, a word vector carrying context information can be calculated according to equation 3. For example, the dimension of the word vector may be 100 dimensions, or may be other, which is not limited in particular.
In the above example, the output layer shares the weight, so that the calculation amount can be reduced, and the expression of the generalization model can be improved.
S104: and generating a word vector sequence of the word segmentation according to the semantic relation in the context information. The word vector sequence comprises word vectors of the word and word vectors of the word in the context information.
As an embodiment, S104 may include: determining a context semantic dependency of the word, the dependency comprising any one or more of: the above, below, the fragments to which they pertain; and generating a word vector sequence of the segmented word according to the context semantic dependency relationship of the segmented word by using a long sequence coding algorithm.
In this embodiment, a long-sequence encoding algorithm may be used, and the word vector sequence of the word segment may be generated based on the word vector carrying the context information obtained in S103 in combination with the forgetting factor.
Different orderings express different semantics. For example, assume that the text S contains T words, and the sentence sequence consisting of the T words is expressed as { x } 1 ,x 2 ,x 3 …x T "word vector carrying context information for the T (1. Ltoreq.t) th word is denoted as e t . Calculation of z by 6 in turn t (T is more than or equal to 1 and less than or equal to T), five front-back dependency relations of the segmented context, the segmented segment, the combination of the context and the segmented segment and the combination of the context and the segmented segment can be obtainedIs tied up.
Wherein z is t Representing the coding of the sequence from position 1 to position t, i.e. the front-back dependency of the sequence from position 1 to position t, alpha (0<α<1) Representing forgetting factors, alpha may be a fixed value between 0 and 1, representing the effect of the preceding sequence on the current word, reflecting word order information of the word in the sequence.
z t The size is |V|, z t The size is independent of the length T of the original text S, i.e. any indefinite length text can be uniquely represented by a code of a specified length.
As can be seen from the above, based on the word vector carrying the context information obtained in S103, a word vector sequence of each segmented word of the text to be recognized can be generated by the above encoding process.
S105: and inputting the word vector sequence of the segmented word into a recognition model which is obtained through training in advance, and obtaining a recognition result of whether the segmented word is a sensitive word.
As an embodiment, the recognition model may be obtained by training the following steps:
obtaining a training sample, the training sample comprising: a word vector sequence consisting of word vectors of a plurality of continuous word segments and classification results corresponding to the plurality of continuous word segments;
inputting the training sample into a classification model of a preset structure;
recording order information of the word vector sequence by utilizing a sequence memory unit in the classification model;
generating a prediction signal based on the sequence by the classification model through a mean value collection strategy;
and iteratively adjusting parameters of the sequence memory unit based on the prediction signals and the classification results to obtain a recognition model after training.
In this embodiment, the recognition model is obtained by training a classification model of a preset structure. The classification model may be a logistic regression-based training model. The logical structure of the classification model may include a plurality of sequence memory units, as shown in fig. 6, which may record order information of the word vector sequence. The training process is that the parameter of the sequence memory unit is adjusted iteratively.
As illustrated with reference to fig. 6, the training samples may include multiple vectors, in one vector [ y, x ] i-c ……x i+c ]For example, x i-c ……x i+c Word vector sequence representing a sequence of word vectors of successive word segments, where x i Word vector, x representing labeled segmented words i-c ……x i-1 Respectively representing word vectors corresponding to c words before marked word segmentation, and x i+1 ……x i+c Respectively representing word vectors corresponding to c words after the marked word segmentation; y represents the classification result corresponding to the labeled word, for example, y is 1 if the word is labeled as a sensitive word, and y is 0 if the word is labeled as a non-sensitive word.
Referring to fig. 6, the training logic is described, the sequence memory unit records the order information of the word vector sequence, and then generates a predicted signal h based on the sequence through a mean value collection strategy; calculating the difference between the predicted signal h and the classification result y, wherein the difference is the loss of the classification model, the model training process is a process of minimizing the loss by using feedback adjustment, the loss is fed back to the sequence memory unit, and the parameters of the sequence memory unit are iteratively adjusted based on the loss until the training is completed, so as to obtain the identification model. The process of model training can also be understood as solving for min (y, h), such as by solving for the optimal solution using a reflection propagation algorithm. The recognition model can be understood as a set of weight parameters from which the prediction value h is calculated, i.e. a classification decision is made.
As an embodiment, after the training sample is input into the classification model, the training sample may be input-transformed first. Assume that the input of the t-moment classification model is the characteristic signal x t The transformation formula may be formula 7:
c_in t =tanh(W xc x t +W hc h t-1 +b c_in ) (7)
Wherein W is xc And W is hc Representing a weight matrix, b c_in Representing the bias vector, tanh () represents the hyperbolic function transformation, c_in t Representing the input of a characteristic signal x at time t t Is a transform signal of h t-1 Representing the input characteristic signal x for time t-1 t-1 Is provided.
Predicted signal at time t and input characteristic signal x corresponding to predicted signal t Input characteristic signal x at time t-1 t-1 Output prediction signal h of (2) t-1 Related to the following.
Memory gating can be included in the classification model, which can model context dependencies of text, and can reduce interference of other factors on gradients inside the model during training. Specifically, memory gating may memorize sequence order information by 8:
wherein W is xi 、W hi 、W xf 、W hf 、W xo And W is ho Representing a weight matrix, b i 、b f And b o Representing the bias vector, g (-) represents the activation function, and in particular, tan h (-) may be used as the activation function. i.e t Representing input gating for measuring certain information of memory, f t Indicating forget gating for measuring forget about some information, o t Output gating is represented, which is used to measure the retention of certain information.
In the present embodiment, the output of the classification model may be converted, and the output of the model at time t is assumed to be the prediction signal h t The transformation formula may be formula 9:
wherein c t Representing the input of a characteristic signal h at time t t I t Representing input gating for measuring certain information of memory, f t Indicating forget gating for measuring forget about some information, o t Output gating is represented, which is used to measure the retention of certain information.
It can be seen that the result c_int of the input transformation and the result i of the memory gating are transformed by equation 9 t 、f t 、o t And (5) fusing.
The classification model is similar to the recognition model in structure, and reference may be made specifically to fig. 7: suppose that a word vector sequence q of the word to be recognized is obtained in S104 i-c ……q i+c ],q i-c ……q i+c A word vector sequence representing a sequence of word vectors of successive word segments, wherein q i Word vector, q representing the word to be recognized i-c ……q i-1 Respectively representing word vectors corresponding to c words before the word segmentation to be recognized, q i+1 ……q +c And respectively representing word vectors corresponding to the c words after the word segmentation to be recognized.
The word vector sequence is input into a recognition model obtained through training, and the word vector sequence is subjected to input transformation, memory gating, output transformation and the like to obtain a signal h. The processing procedure of the recognition model on the word vector sequence can refer to the contents of the input transformation, the memory gating and the output transformation parts. The class judgment is carried out on the signal h to obtain the prediction probability q θ (h) A. The invention relates to a method for producing a fibre-reinforced plastic composite Specific category decision formulas may refer to formula 10:
where θ represents model parameters obtained by logistic regression training. In equation 10, when the probability q is predicted θ (h) When the number is greater than or equal to 0.5, z is 1, which indicates that the recognition result is: the segmentation is sensitive words. 0.5 is just a set threshold, and specific values are not limited.
After S105, the identified sensitive word may be added to a stored sensitive word stock. For example, in one embodiment, the general sensitive words are stored in the form of a dictionary tree in advance, and then the identified sensitive words may be added to the dictionary tree after S105.
By applying the embodiment of the invention, the context information corresponding to each word in the word segmentation is determined; generating a word vector sequence of the word according to the semantic relation in the context information, wherein the word vector sequence comprises the word vector of the word and the word vector of the word in the context information; inputting the word vector sequence of the word into a recognition model which is trained in advance to obtain a recognition result of whether the word is a sensitive word or not; for the variant of the sensitive word, even if the font or the word sound is changed, the context semantic dependency relationship is unchanged, so that the variant of the sensitive word can be identified based on the context semantic dependency relationship in the scheme, and the identification effect is improved.
To verify the effect of the embodiments of the present invention, the following experiments were performed: 10 days are randomly selected in one month, 5000 chat texts are randomly screened every day, and sensitive word recognition is carried out on the screened texts. And defining two indexes of accuracy and coverage rate in an experiment to measure the identification effect. Wherein, the accuracy is defined as: identifying the exact number of samples divided by the predicted total number of samples, the coverage is defined as: the exact number of samples identified is divided by the total number of samples noted.
Experimental data show that the accuracy of the embodiment of the invention is 90.2% and the coverage rate is 85.2%. The standard deviation of the accuracy is 0.02, the standard deviation of the coverage rate is 0.14, and the performance is stable. When the traditional word recognition algorithm is applied, the accuracy rate at the initial stage is 80.1 percent, the coverage rate is 75.2 percent and the accuracy rate and the coverage rate can be obviously reduced along with the time due to the hysteresis of word stock updating; the standard deviation of the accuracy was 42.2 and the standard deviation of the coverage was 55.1. Therefore, the embodiment of the invention is obviously superior to the traditional algorithm in recognition accuracy and stability.
Fig. 8 is a second flowchart of a method for identifying a sensitive word according to an embodiment of the present invention, including:
S801: and acquiring a text to be identified.
For example, the text to be identified may be text content published by the user in various scenes such as internet forums, personal homepages, game chat, etc. The text to be recognized may be chinese text, english text, japanese text, etc., and the specific language is not limited.
S802: preprocessing the text to be identified by any one or more of the following: cleaning characters, converting full angle into half angle, converting traditional Chinese to simplified Chinese, converting alphabetic writing, merging split words, and recovering harmonic words to obtain a preprocessed text.
For example, if the text to be recognized is chinese text, the character cleaning may be: and removing characters such as non-Chinese characters, english letters, chat expressions, "# - @ and # -, and the like.
Full angle to half angle: that is, the full-angle character is uniformly converted into the half-angle character, for example, "ABC 123" is converted into "ABC 123".
Complex body changes into simplified body: i.e. converting traditional Chinese characters into simplified words, e.g. "one pears year" into "one thousand years".
Pinyin-to-text: i.e., converting the letters of pinyin to simplified words, e.g., "yiwannian" to "one thousand years".
Splitting word combination: if the text to be recognized is Chinese text, split words are combined into normal Chinese characters, for example, the "vehicle's front" is converted into "wheels".
Harmonic word reduction: the harmonic characters are restored to normal Chinese characters, for example, the "jungle period" is converted to the "processor".
If the pretreatment is performed in the above-mentioned various ways, the order of the various ways is not limited.
There may be some sensitive word variants in the text to be recognized that are similar in shape or similar in pronunciation. For example, some users add characters such as non-Chinese characters, english letters, chat expressions, "# -"% @ and the like between sensitive words, or convert the sensitive words into full-angle characters, or convert the sensitive words into traditional Chinese characters, or convert the sensitive words into pinyin, or convert the sensitive words into split words, or convert the sensitive words into harmonic words, and the like. Through the pretreatment, the variants of some sensitive words can be reduced to the sensitive words, so that the interference of the variants of the sensitive words on recognition is reduced.
S803: iteratively intercepting character strings with preset lengths from the preprocessed text.
S804: for each character string intercepted, the character string is matched with a pre-established dictionary tree, and whether a branch matched with the character string exists in the dictionary tree is judged, and if so, S805 is executed.
S805: and identifying the character string as a general sensitive word, and determining the preprocessed text which does not comprise the general sensitive word as a text to be processed.
For example, as shown in fig. 2, the dictionary tree may include a root node and a plurality of leaf nodes, where the root node may be null, or the root node may not include a character, and the leaf nodes include a character, where the characters from the root node to the end leaf nodes are connected into a character string, and the character string is a sensitive word. The root node to the end leaf node form branches of the dictionary tree, and it is understood that one branch corresponds to one character string, or one branch corresponds to one sensitive word. The character strings, dictionary tree structures, etc. in fig. 2 are merely examples, and are not limited to the dictionary tree structures.
Iteratively intercepting character strings with preset lengths from the preprocessed text, wherein the preset lengths can be the maximum lengths of sensitive words in the dictionary tree. For example, assuming that the preset length is 5 characters and assuming that the preprocessed text is "we are good friends", the character strings iteratively truncated may be respectively: we are good friends and friends. Merely by way of example, the pre-processed text is typically a longer piece of text content.
The process of matching each cut character string with the dictionary tree is similar, and the following character string is exemplified:
The matching process is a process of traversing the dictionary tree, wherein the traversal starts from a root node, and each character in the character string is sequentially matched with each leaf node in each branch of the dictionary tree according to the direction in the dictionary tree (the direction from the root node to the tail leaf node); if the character in the character string is successfully matched with each leaf node in the branch, the branch is matched with the character string.
Specifically, if the number of characters in a string is greater than the number of leaf nodes in a branch, e.g., the string is "we are good and" we "and one branch of the dictionary tree is" we ", then the first two characters of the string are successfully matched with both leaf nodes of the branch, in which case the branch is indicated as being matched with the string. If the string matches a branch in the dictionary tree, it is indicated that the string is a sensitive word.
In this embodiment, some general sensitive words are stored in advance, and these general sensitive words are stored in the form of a dictionary tree. The dictionary tree structure uses the common prefix (empty root node), so that the expenditure of the query time can be reduced, and the matching efficiency can be improved.
In this embodiment, the character strings in the preprocessed text are matched with the dictionary tree, that is, the universal sensitive words in the preprocessed text are first identified. Under the condition, the universal sensitive words identified in the step S805 can be removed from the preprocessed text, the rest text is used as the text to be processed, and the subsequent steps are continuously executed for the text to be processed, so that repeated identification of the identified sensitive words can be avoided, unnecessary calculation amount is reduced, and calculation efficiency is improved.
S806: and performing segmentation processing on the text to be processed to obtain a plurality of segmentation words.
As an embodiment, S806 may include: calculating mutual information between every two adjacent words in the text to be processed, wherein the mutual information represents the association degree between the adjacent words; for each piece of mutual information greater than a preset association threshold, forming a candidate binary group by two adjacent words corresponding to the mutual information; and calculating the information entropy of each candidate binary group, and carrying out segmentation processing on the text to be processed according to the calculated information entropy to obtain a plurality of segmentation words.
For example, the mutual information between every two adjacent words in the text to be processed can be calculated using the following formula:
pmi=p (x, y)/p (x) ·p (y) =cnt (x, y)/cnt (x) ·cnt (y) (formula 1)
Wherein, the larger the PMI is, the stronger the relation degree between adjacent words is, p (x) and p (y) respectively represent the probability of occurrence of the event x and the event y, p (x, y) represents the probability of occurrence of the event x and the event y at the same time, and cnt represents a function of the statistical frequency.
The association threshold may be preset, and if the mutual information between two adjacent words is greater than the association threshold, it indicates that the association degree of the two adjacent words is stronger, and the two adjacent words are formed into a candidate binary group. The association threshold may be set according to practical situations, for example, may be 1, and specific numerical values are not limited. It can be understood that the association degree between each word in the sensitive word is strong, and candidate tuples with strong association degree can be screened out from the text to be processed, and the tuples consist of two words.
In this embodiment, the information entropy of each candidate binary group is calculated, where the information entropy represents the randomness degree of a text segment left-neighbor set and right-neighbor set. For a certain word, the more the matching words between the word and the word adjacent to the word from the left and the right are, the greater the probability that the word and the word adjacent to the word from the left and the right are belonging to different words is. For example, the information entropy of the candidate doublet may be calculated using the following formula:
wherein Entropy represents information Entropy, U represents candidate binary groups, i represents the identification of a word, and p i Representing the probability of the occurrence of a word identified as i and n representing the number of adjacent words.
For example, taking the word "eat grape but not spit grape skin" as an example, the word "grape" appears four times, wherein the left-neighbor words are { eat, spit, eat, spit }, and the right-neighbor words are { do not, skin, spit, skin }, respectively. Calculated according to equation 2: the information entropy of the left adjacent word of the word "grape" is- (1/2) & log (1/2), and the information entropy of the right adjacent word is- (1/2) & log (1/2) - (1/4) & log (1/4).
As an embodiment, for each candidate binary group, it may be determined whether the information entropy of the candidate binary group is greater than a preset probability threshold: if the candidate binary group is larger than the word segmentation value, determining the candidate binary group as a word segmentation; if not, other steps can be not executed, the candidate binary groups can be expanded to obtain a multi-group, and whether the information entropy of the multi-group is larger than the preset probability threshold value is judged; if so, the multi-element group is determined to be a word.
The probability threshold may be set according to actual situations, and specific numerical values are not limited.
In this case, a length threshold may be set, where the length threshold represents the maximum length of the word, and the length threshold may be set according to the actual situation, for example, may be set to 5 characters, and the specific numerical value is not limited. In this case, when the candidate binary group is expanded, the length of the expanded multi-group does not exceed the length threshold.
S807: for each word segment, determining the context information corresponding to each word in the word segment.
The following description will be given by taking the processing of one word segmentation as an example: as an implementation manner, for each word in the word segmentation, stroke information of the word can be obtained; performing characteristic numerical processing on the stroke information to obtain a multi-element characteristic sequence of the character; and inputting the multi-element feature sequence of the word into a mapping model obtained through preset training to obtain the context information corresponding to the word.
For example, a word vector construction method based on the character characteristics of radicals, chinese character components and the like can be adopted, and the constructed word vector comprises context information corresponding to the word. These stroke features facilitate recognition of sensitive word variants that are close in character. Each word may be encoded based on stroke characteristics, e.g., chinese strokes may be divided into five categories:
Stroke name Transverse bar Vertical column Skimming Right-falling stroke Hook
Shape and shape A first part I (I) (V) One (V)
ID 1 2 3 4 5
Specifically, the text to be recognized can be divided into individual words, and then the words are divided into strokes, so that the stroke information of the words is obtained. Feature-digitizing the stroke information may be understood as converting the stroke to an ID. And then combining the IDs according to the stroke order to obtain a multi-element characteristic sequence.
The multi-element feature sequence can be obtained by using an N-element window sliding mode. For example, after feature numerical processing is performed on the stroke information of "wandering" of the Chinese character, a sequence 1224443533,3-element feature sequence (3 strokes) is 122,4-element feature sequence (4 strokes) is 1224, and so on. The "multivariate feature sequence" as referred to herein may be an N-ary feature sequence, i.e., an N-ary encoding of a word. Taking n=5 as an example, the 5-membered feature sequence can be as follows:
122
224
...
1224
12244
...
and inputting the obtained multi-element feature sequence into a mapping model obtained by preset training to obtain a word vector carrying the context information.
For example, the embedded neural network of the preset structure may be trained to obtain the mapping model. Referring to fig. 5, the embedded neural network and the mapping model may include an input layer, a hidden layer, and an output layer, the input layer may have a dimension V, the hidden layer may have a dimension N, the output layer may have a dimension c×v, and C represents half of the context length. Taking a word as an example, the input layer inputs a multi-element feature sequence x which can be the word k The output layer outputs x k Word vector carrying context information{x k-C ,x k-C-1 ,...,x k-1 ,x k+1 ,...,x k+C-1 ,x k+C Context length is 2C.
In fig. 5, a weight matrix W exists between the input layer and the hidden layer V×N ,W V×N The i-th line of (a) represents the weight of the i-th word in a vocabulary, wherein the vocabulary comprises common words, and the vocabulary can be obtained based on historical chat text statistics; the weight matrix W V×N Contains weight information of all words in vocabulary, the weight matrix W V×N The training process is a process of adjusting parameters embedded in the neural network and the mapping model.
There is an output vector W 'of dimension NXV between the hidden layer and the output layer' V is provided. The hidden layer comprises N nodes, and the data obtained after the input layer is subjected to weighted summation is input to the nodes of the hidden layer.
For example, the output layer may share weights, such as generating a polynomial distribution of the C-th word by a softmax function, the polynomial including a probability of each word in the context corresponding to the C-th word with the C-th word, and in particular, the probability may be calculated by equation 3:
wherein y is c,j Representing a predictive conditional probability distribution of a j-th word in the context; p (w) c,j =w O,c |w I ) Representing for a selected word w I In terms of the conditional probability that the j-th word in the context is O, ": = "means" equivalent to ".
For example, "this cucumber" is selected for the word "yellow", the words in the context are "this", "the word" melon ", the conditional probability P that the 2 nd word in the context is" the word "|" yellow ", in equation 3 this probability is abbreviated as y_ (" yellow ", 2), i.e. j=2, representing the 2 nd word in the context. The softmax function may also be expressed as an exp ()/sum (exp ()) function. u (u) c,j Representing vectors of selected word CIs a vector which is input and is output after being transformed by a hidden layer, wherein the transformation from the input layer to the hidden layer is h=x T W=v w,I The change from hidden layer to output layer is u c,j =h.mu w,j
The process of training the resulting mapping model may include: obtaining training text corpus, which can not include labeling information, inputting the text corpus into an unregulated embedded neural network, and iteratively learning weight w by utilizing a mathematical optimal value algorithm such as a back propagation and random gradient descent algorithm; wherein the weight updating iterative formula from the hidden layer to the output layer can refer to the formula 4; the weights of the input layer to the hidden layer update the iterative formula reference 5.
Wherein w' (new) Representing updated weights from hidden layer to output layer, w' (old) Representing pre-update weights from hidden layer to output layer, η represents learning rate in training, y c,j Representing the predictive conditional probability distribution, t, of the jth word in the context c,j Representing the statistical frequency probability distribution of the j-th word in the context, h i The ith node representing the hidden layer is obtained by calculating y c,j And t c,j The model can be continuously optimized, so that the prediction probability output by the neural network is continuously close to the real statistical probability.
Wherein w is (new) Representing updated weights of input layer to hidden layer, w i,j Representing the learning weight of the jth word in the ith given word,representing the pre-update weight of the input layer to the hidden layer, η represents the learning rate in training, y c,j In the context of representationPredictive conditional probability distribution of the jth word, t c,j Representing the statistical frequency probability distribution of the j-th word in the context, h i Represents the i-th node of the hidden layer, and V represents the number of words in the corpus.
After the iteration converges, a word vector carrying context information can be calculated according to equation 3. For example, the dimension of the word vector may be 100 dimensions, or may be other, which is not limited in particular.
In the above example, the output layer shares the weight, so that the calculation amount can be reduced, and the expression of the generalization model can be improved.
S808: and generating a word vector sequence of the word segmentation according to the semantic relation in the context information. The word vector sequence comprises word vectors of the segmented words and word vectors of the segmented words in the context information.
As an embodiment, S808 may include: determining a context semantic dependency of the word, the dependency comprising any one or more of: the above, below, the fragments to which they pertain; and generating a word vector sequence of the segmented word according to the context semantic dependency relationship of the segmented word by using a long sequence coding algorithm.
In this embodiment, a long-sequence encoding algorithm may be used, and based on the word vector carrying the context information obtained in S807, the word vector sequence of the segmented word may be generated in combination with the forgetting factor.
Different orderings express different semantics. For example, assume that the text S contains T words, and the sentence sequence consisting of the T words is expressed as { x } 1 ,x 2 ,x 3 …x T "word vector carrying context information for the T (1. Ltoreq.t) th word is denoted as e t . Calculation of z by 6 in turn t (T is more than or equal to 1 and less than or equal to T), five front-back dependency relations of the segmented context, the segmented segment, the combination of the context and the segmented segment and the combination of the context and the segmented segment can be obtained.
Wherein z is t Representing the coding of the sequence from position 1 to position t, i.e. the front-back dependency of the sequence from position 1 to position t, alpha (0<α<1) Representing forgetting factors, alpha may be a fixed value between 0 and 1, representing the effect of the preceding sequence on the current word, reflecting word order information of the word in the sequence.
z t The size is |V|, z t The size is independent of the length T of the original text S, i.e. any indefinite length text can be uniquely represented by a code of a specified length.
As can be seen from the above, based on the word vector carrying the context information obtained in S807, a word vector sequence of each segmented word of the text to be recognized can be generated by the above encoding process.
S809: and inputting the word vector sequence of the segmented word into a recognition model which is obtained through training in advance, and obtaining a recognition result of whether the segmented word is a sensitive word.
As an embodiment, the recognition model may be obtained by training the following steps:
obtaining a training sample, the training sample comprising: a word vector sequence consisting of word vectors of a plurality of continuous word segments and classification results corresponding to the plurality of continuous word segments;
inputting the training sample into a classification model of a preset structure;
recording order information of the word vector sequence by utilizing a sequence memory unit in the classification model;
generating a prediction signal based on the sequence by the classification model through a mean value collection strategy;
and iteratively adjusting parameters of the sequence memory unit based on the prediction signals and the classification results to obtain a recognition model after training.
In this embodiment, the recognition model is obtained by training a classification model of a preset structure. The classification model may be a logistic regression-based training model. The logical structure of the classification model may include a plurality of sequence memory units, as shown in fig. 6, which may record order information of the word vector sequence. The training process is that the parameter of the sequence memory unit is adjusted iteratively.
As illustrated with reference to fig. 6, the training samples may include multiple vectors, in one vector [ y, x ] i-c ……x i+c ]For example, x i-c ……x i+c Word vector sequence representing a sequence of word vectors of successive word segments, where x i Word vector, x representing labeled segmented words i-c ……x i-1 Respectively representing word vectors corresponding to c words before marked word segmentation, and x i+1 ……x i+c Respectively representing word vectors corresponding to c words after the marked word segmentation; y represents the classification result corresponding to the labeled word, for example, y is 1 if the word is labeled as a sensitive word, and y is 0 if the word is labeled as a non-sensitive word.
Referring to fig. 6, the training logic is described, the sequence memory unit records the order information of the word vector sequence, and then generates a predicted signal h based on the sequence through a mean value collection strategy; calculating the difference between the predicted signal h and the classification result y, wherein the difference is the loss of the classification model, the model training process is a process of minimizing the loss by using feedback adjustment, the loss is fed back to the sequence memory unit, and the parameters of the sequence memory unit are iteratively adjusted based on the loss until the training is completed, so as to obtain the identification model. The process of model training can also be understood as solving for min (y, h), such as by solving for the optimal solution using a reflection propagation algorithm. The recognition model can be understood as a set of weight parameters from which the prediction value h is calculated, i.e. a classification decision is made.
As an embodiment, after the training sample is input into the classification model, the training sample may be input-transformed first. Assume that the input of the t-moment classification model is the characteristic signal x t The transformation formula may be formula 7:
c_in t =tanh(W xc x t +W hc h t-1 +b c_in ) (7)
Wherein W is xc And W is hc Representing a weight matrix, b c_in Representing the bias vector, tanh () represents the hyperbolic functionDigital conversion, c_in t Representing the input of a characteristic signal x at time t t Is a transform signal of h t-1 Representing the input characteristic signal x for time t-1 t-1 Is provided.
Predicted signal at time t and input characteristic signal x corresponding to predicted signal t Input characteristic signal x at time t-1 t-1 Output prediction signal h of (2) t-1 Related to the following.
Memory gating can be included in the classification model, which can model context dependencies of text, and can reduce interference of other factors on gradients inside the model during training. Specifically, memory gating may memorize sequence order information by 8:
wherein W is xi 、W hi 、W xf 、W hf 、W xo And W is ho Representing a weight matrix, b i 、b f And b o Representing the bias vector, g (-) represents the activation function, and in particular, tan h (-) may be used as the activation function. i.e t Representing input gating for measuring certain information of memory, f t Indicating forget gating for measuring forget about some information, o t Output gating is represented, which is used to measure the retention of certain information.
In the present embodiment, the output of the classification model may be converted, and the output of the model at time t is assumed to be the prediction signal h t The transformation formula may be formula 9:
wherein c t Representing the input of a characteristic signal h at time t t I t Representing input gating for measuring certain information of memory, f t Indicating forget gating for measuring forget about some information, o t Output gating is represented, which is used to measure the retention of certain information.
It can be seen that the result c_int of the input transformation and the result i of the memory gating are transformed by equation 9 t 、f t 、o t And (5) fusing.
The classification model is similar to the recognition model in structure, and reference may be made specifically to fig. 7: assume that a word vector sequence q of the word to be recognized is obtained in S808 i-c ……q i+c ],q i-c ……q i+c A word vector sequence representing a sequence of word vectors of successive word segments, wherein q i Word vector, q representing the word to be recognized i-c ……q i-1 Respectively representing word vectors corresponding to c words before the word segmentation to be recognized, q i+1 ……q +c And respectively representing word vectors corresponding to the c words after the word segmentation to be recognized.
The word vector sequence is input into a recognition model obtained through training, and the word vector sequence is subjected to input transformation, memory gating, output transformation and the like to obtain a signal h. The processing procedure of the recognition model on the word vector sequence can refer to the contents of the input transformation, the memory gating and the output transformation parts. The class judgment is carried out on the signal h to obtain the prediction probability q θ (h) A. The invention relates to a method for producing a fibre-reinforced plastic composite Specific category decision formulas may refer to formula 10:
where θ represents model parameters obtained by logistic regression training. In equation 10, when the probability q is predicted θ (h) When the number is greater than or equal to 0.5, z is 1, which indicates that the recognition result is: the segmentation is sensitive words. 0.5 is just a set threshold, and specific values are not limited.
After S809, the identified sensitive word may be added to a stored sensitive word stock. For example, in one embodiment described above, the general-purpose sensitive word is stored in advance in the form of a dictionary tree, and then, after S809, the identified sensitive word may be added to the dictionary tree.
By applying the embodiment shown in fig. 8, in the first aspect, for the variant of the sensitive word, even if the font or the word pronunciation changes, the context semantic dependency relationship is unchanged, so that the variant of the sensitive word can be identified based on the context semantic dependency relationship in the scheme, and the identification effect is improved. In the second aspect, through preprocessing, variants of some sensitive words are reduced to the sensitive words, so that interference caused by the variants of the sensitive words on recognition is reduced. In the third aspect, general sensitive words in the text are recognized first, and subsequent recognition is only carried out on the text which does not contain the general sensitive words, so that repeated recognition of the recognized sensitive words is avoided, unnecessary calculation amount is reduced, and calculation efficiency is improved.
To verify the effect of the embodiments of the present invention, the following experiments were performed: 10 days are randomly selected in one month, 5000 chat texts are randomly screened every day, and sensitive word recognition is carried out on the screened texts. And defining two indexes of accuracy and coverage rate in an experiment to measure the identification effect. Wherein, the accuracy is defined as: identifying the exact number of samples divided by the predicted total number of samples, the coverage is defined as: the exact number of samples identified is divided by the total number of samples noted.
Experimental data show that the accuracy of the embodiment of the invention is 90.2% and the coverage rate is 85.2%. The standard deviation of the accuracy is 0.02, the standard deviation of the coverage rate is 0.14, and the performance is stable. When the traditional word recognition algorithm is applied, the accuracy rate at the initial stage is 80.1 percent, the coverage rate is 75.2 percent and the accuracy rate and the coverage rate can be obviously reduced along with the time due to the hysteresis of word stock updating; the standard deviation of the accuracy was 42.2 and the standard deviation of the coverage was 55.1. Therefore, the embodiment of the invention is obviously superior to the traditional algorithm in recognition accuracy and stability.
Corresponding to the above method embodiment, the embodiment of the present invention further provides a sensitive word recognition device, as shown in fig. 9, including:
An obtaining module 901, configured to obtain a text to be identified;
the segmentation module 902 is configured to perform segmentation processing on the text to be identified to obtain a plurality of segmentation words;
a determining module 903, configured to determine, for each word segment, context information corresponding to each word in the word segment;
a generating module 904, configured to generate a word vector sequence of the segmented word according to the semantic relation in the context information, where the word vector sequence includes a word vector of the segmented word and a word vector of the segmented word in the context information;
the first recognition module 905 is configured to input the word vector sequence of the segmented word into a recognition model obtained by training in advance, so as to obtain a recognition result of whether the segmented word is a sensitive word.
As an embodiment, the apparatus further comprises:
a preprocessing module (not shown in the figure) for preprocessing the text to be recognized by any one or more of the following: cleaning characters, converting full angle into half angle, converting traditional Chinese to simplified Chinese, converting alphabetic writing, merging split characters and restoring harmonic characters to obtain a preprocessed text;
the segmentation module 902 is specifically configured to: and carrying out segmentation processing on the preprocessed text to obtain a plurality of segmentation words.
As an embodiment, the apparatus further comprises:
A second recognition module (not shown in the figure) for iteratively intercepting character strings with preset lengths from the text to be recognized; matching each intercepted character string with a pre-established dictionary tree; if there is a branch in the dictionary tree that matches the string, the string is identified as a sensitive word.
As one embodiment, the segmentation module 902 includes: a calculation sub-module, a composition sub-module, and a segmentation sub-module (not shown), wherein,
the computing sub-module is used for computing mutual information between every two adjacent words in the text to be recognized, wherein the mutual information represents the association degree between the adjacent words;
the composition sub-module is used for composing two adjacent words corresponding to each piece of mutual information which is larger than a preset association threshold value into a candidate binary group;
and the segmentation module is used for calculating the information entropy of each candidate binary group, and carrying out segmentation processing on the text to be identified according to the calculated information entropy to obtain a plurality of segmentation words.
As an embodiment, the segmentation submodule includes:
the judging unit is used for judging whether the information entropy of each candidate binary group is larger than a preset probability threshold value or not according to each candidate binary group; if the first determination unit is triggered, and if the first determination unit is not triggered, the second determination unit is triggered;
A first determining unit configured to determine the candidate binary group as a word segment;
the second determining unit is used for expanding the candidate binary groups to obtain a multi-group, and judging whether the information entropy of the multi-group is larger than the preset probability threshold value or not; if so, the multi-element group is determined to be a word.
As one embodiment, the information entropy includes left information entropy and right information entropy; the judging unit is specifically configured to: judging whether the left information entropy and the right information entropy of the candidate binary group are both larger than a preset probability threshold value;
the second determining unit is specifically configured to: under the condition that the left information entropy is not greater than the preset probability threshold, expanding the candidate binary group leftwards to obtain a left expanded multi-group, and judging whether the information entropy of the left expanded multi-group is greater than the preset probability threshold; determining the left expanded multi-tuple as a word segmentation under the condition that the information entropy of the left expanded multi-tuple is larger than the preset probability threshold; under the condition that the right information entropy is not greater than the preset probability threshold, expanding the candidate binary group rightward to obtain a right expanded multi-group, and judging whether the information entropy of the right expanded multi-group is greater than the preset probability threshold; and under the condition that the information entropy of the right expanded multi-element group is larger than the preset probability threshold value, determining the right expanded multi-element group as a word segmentation.
As an embodiment, the apparatus further comprises:
a judging module (not shown in the figure) for judging whether the length of the left expanded multi-element group reaches a preset length threshold; if not, executing the step of judging whether the information entropy of the left expanded multi-element group is larger than the preset probability threshold value;
judging whether the length of the right expanded multi-element group reaches a preset length threshold value or not; and if not, executing the step of judging whether the information entropy of the right expanded multi-element group is larger than the preset probability threshold value.
As an embodiment, the determining module 903 is specifically configured to:
for each word in the word segmentation, acquiring stroke information of the word;
performing characteristic numerical processing on the stroke information to obtain a multi-element characteristic sequence of the character;
and inputting the multi-element feature sequence of the word into a mapping model obtained through preset training to obtain the context information corresponding to the word.
As an embodiment, the generating module 904 is specifically configured to:
determining a context semantic dependency of the word, the dependency comprising any one or more of: the above, below, the fragments to which they pertain;
And generating a word vector sequence of the segmented word according to the context semantic dependency relationship of the segmented word by using a long sequence coding algorithm.
As an embodiment, the apparatus further comprises:
a model training module (not shown) for obtaining training samples, the training samples comprising: a word vector sequence consisting of word vectors of a plurality of continuous word segments and classification results corresponding to the plurality of continuous word segments; inputting the training sample into a classification model of a preset structure; recording order information of the word vector sequence by utilizing a sequence memory unit in the classification model; generating a prediction signal based on the sequence by the classification model through a mean value collection strategy; and iteratively adjusting parameters of the sequence memory unit based on the prediction signals and the classification results to obtain a recognition model after training.
By applying the embodiment of the invention, the context information corresponding to each word in the word segmentation is determined; generating a word vector sequence of the word according to the semantic relation in the context information, wherein the word vector sequence comprises the word vector of the word and the word vector of the word in the context information; inputting the word vector sequence of the word into a recognition model which is trained in advance to obtain a recognition result of whether the word is a sensitive word or not; for the variant of the sensitive word, even if the font or the word sound is changed, the context semantic dependency relationship is unchanged, so that the variant of the sensitive word can be identified based on the context semantic dependency relationship in the scheme, and the identification effect is improved.
The embodiment of the invention also provides an electronic device, as shown in fig. 10, comprising a processor 1001 and a memory 1002,
a memory 1002 for storing a computer program;
the processor 1001 is configured to implement any one of the above-described sensitive word recognition methods when executing a program stored in the memory 1002.
The Memory mentioned in the electronic device may include a random access Memory (Random Access Memory, RAM) or may include a Non-Volatile Memory (NVM), such as at least one magnetic disk Memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), etc.; but also digital signal processors (Digital Signal Processing, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
The embodiment of the invention also provides a computer readable storage medium, wherein a computer program is stored in the computer readable storage medium, and the computer program realizes any sensitive word recognition method when being executed by a processor.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for apparatus embodiments, device embodiments, and computer-readable storage medium embodiments, the description is relatively simple, as it is substantially similar to method embodiments, with reference to the portions of the method embodiments that are described herein.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims (20)

1. A method for recognizing a sensitive word, comprising:
acquiring a text to be identified;
performing segmentation processing on the text to be identified to obtain a plurality of segmentation words;
for each word segment, determining the context information corresponding to each word in the word segment; generating a word vector sequence of the word according to the semantic relation in the context information, wherein the word vector sequence comprises the word vector of the word and the word vector of the word in the context information;
inputting the word vector sequence of the word into a recognition model which is trained in advance to obtain a recognition result of whether the word is a sensitive word or not;
the determining the context information corresponding to each word in the word segmentation includes:
for each word in the word segmentation, acquiring stroke information of the word;
performing feature numerical processing on the stroke information to obtain an ID corresponding to the stroke information;
combining IDs corresponding to the stroke information in a sliding mode of an N-element window according to the stroke order to obtain a multi-element feature sequence of the character;
And inputting the multi-element feature sequence of the word into a mapping model obtained through preset training to obtain the context information corresponding to the word.
2. The method of claim 1, further comprising, after the obtaining the text to be recognized:
and carrying out any one or more of the following preprocessing on the text to be identified: cleaning characters, converting full angle into half angle, converting traditional Chinese to simplified Chinese, converting alphabetic writing, merging split characters and restoring harmonic characters to obtain a preprocessed text;
the text to be identified is segmented to obtain a plurality of segmentation words, which comprises the following steps:
and carrying out segmentation processing on the preprocessed text to obtain a plurality of segmentation words.
3. The method of claim 1, further comprising, after the obtaining the text to be recognized:
iteratively intercepting character strings with preset lengths from the text to be identified;
matching each intercepted character string with a pre-established dictionary tree;
if there is a branch in the dictionary tree that matches the string, the string is identified as a sensitive word.
4. The method of claim 1, wherein the performing segmentation on the text to be identified to obtain a plurality of segmentation words includes:
Calculating mutual information between every two adjacent words in the text to be recognized, wherein the mutual information represents the association degree between the adjacent words;
for each piece of mutual information greater than a preset association threshold, forming a candidate binary group by two adjacent words corresponding to the mutual information;
and calculating the information entropy of each candidate binary group, and carrying out segmentation processing on the text to be identified according to the calculated information entropy to obtain a plurality of segmentation words.
5. The method according to claim 4, wherein the performing segmentation processing on the text to be recognized according to the calculated information entropy to obtain a plurality of segmentation words includes:
judging whether the information entropy of each candidate binary group is larger than a preset probability threshold value or not according to each candidate binary group;
if the candidate binary group is larger than the word segmentation value, determining the candidate binary group as a word segmentation;
if not, expanding the candidate binary group to obtain a multi-group, and judging whether the information entropy of the multi-group is larger than the preset probability threshold; if so, the multi-element group is determined to be a word.
6. The method of claim 5, wherein the information entropy comprises left information entropy and right information entropy; the judging whether the information entropy of the candidate binary group is larger than a preset probability threshold value comprises the following steps:
Judging whether the left information entropy and the right information entropy of the candidate binary group are both larger than a preset probability threshold value;
expanding the candidate binary group to obtain a multi-group, and judging whether the information entropy of the multi-group is larger than the preset probability threshold or not, wherein the method comprises the following steps:
under the condition that the left information entropy is not greater than the preset probability threshold, expanding the candidate binary group leftwards to obtain a left expanded multi-group, and judging whether the information entropy of the left expanded multi-group is greater than the preset probability threshold;
under the condition that the right information entropy is not greater than the preset probability threshold, expanding the candidate binary group rightward to obtain a right expanded multi-group, and judging whether the information entropy of the right expanded multi-group is greater than the preset probability threshold;
said determining said plurality of groups as a word segment comprising:
determining the left expanded multi-tuple as a word segmentation under the condition that the information entropy of the left expanded multi-tuple is larger than the preset probability threshold;
and under the condition that the information entropy of the right expanded multi-element group is larger than the preset probability threshold value, determining the right expanded multi-element group as a word segmentation.
7. The method of claim 6, further comprising, after said obtaining the left expanded multi-tuple:
judging whether the length of the left expanded multi-element group reaches a preset length threshold value or not;
if not, executing the step of judging whether the information entropy of the left expanded multi-element group is larger than the preset probability threshold value;
after the right expanded multi-tuple is obtained, the method further comprises:
judging whether the length of the right expanded multi-element group reaches a preset length threshold value or not;
and if not, executing the step of judging whether the information entropy of the right expanded multi-element group is larger than the preset probability threshold value.
8. The method of claim 1, wherein generating the word vector sequence of the segmented word based on the semantic relationship in the context information comprises:
determining a context semantic dependency of the word, the dependency comprising any one or more of: the above, below, the fragments to which they pertain;
and generating a word vector sequence of the segmented word according to the context semantic dependency relationship of the segmented word by using a long sequence coding algorithm.
9. The method according to claim 1, wherein the recognition model is trained by:
Obtaining a training sample, the training sample comprising: a word vector sequence consisting of word vectors of a plurality of continuous word segments and classification results corresponding to the plurality of continuous word segments;
inputting the training sample into a classification model of a preset structure;
recording order information of the word vector sequence by utilizing a sequence memory unit in the classification model;
generating a prediction signal based on the sequence by the classification model through a mean value collection strategy;
and iteratively adjusting parameters of the sequence memory unit based on the prediction signals and the classification results to obtain a recognition model after training.
10. A sensitive word recognition apparatus, comprising:
the acquisition module is used for acquiring the text to be identified;
the segmentation module is used for carrying out segmentation processing on the text to be identified to obtain a plurality of segmentation words;
the determining module is used for determining context information corresponding to each word in each word segmentation aiming at each word segmentation;
the generation module is used for generating a word vector sequence of the word according to the semantic relation in the context information, wherein the word vector sequence comprises the word vector of the word and the word vector of the word in the context information;
The first recognition module is used for inputting the word vector sequence of the segmented word into a recognition model which is obtained through training in advance to obtain a recognition result of whether the segmented word is a sensitive word or not;
the determining module is specifically configured to:
for each word in the word segmentation, acquiring stroke information of the word;
performing feature numerical processing on the stroke information to obtain an ID corresponding to the stroke information;
combining IDs corresponding to the stroke information in a sliding mode of an N-element window according to the stroke order to obtain a multi-element feature sequence of the character;
and inputting the multi-element feature sequence of the word into a mapping model obtained through preset training to obtain the context information corresponding to the word.
11. The apparatus of claim 10, wherein the apparatus further comprises:
the preprocessing module is used for preprocessing any one or more of the following texts to be recognized: cleaning characters, converting full angle into half angle, converting traditional Chinese to simplified Chinese, converting alphabetic writing, merging split characters and restoring harmonic characters to obtain a preprocessed text;
the segmentation module is specifically configured to: and carrying out segmentation processing on the preprocessed text to obtain a plurality of segmentation words.
12. The apparatus of claim 10, wherein the apparatus further comprises:
The second recognition module is used for iteratively intercepting character strings with preset lengths from the text to be recognized; matching each intercepted character string with a pre-established dictionary tree; if there is a branch in the dictionary tree that matches the string, the string is identified as a sensitive word.
13. The apparatus of claim 10, wherein the segmentation module comprises:
the computing sub-module is used for computing mutual information between every two adjacent words in the text to be recognized, wherein the mutual information represents the association degree between the adjacent words;
the composition sub-module is used for composing two adjacent words corresponding to each piece of mutual information which is larger than a preset association threshold value into a candidate binary group;
and the segmentation module is used for calculating the information entropy of each candidate binary group, and carrying out segmentation processing on the text to be identified according to the calculated information entropy to obtain a plurality of segmentation words.
14. The apparatus of claim 13, wherein the segmentation submodule comprises:
the judging unit is used for judging whether the information entropy of each candidate binary group is larger than a preset probability threshold value or not according to each candidate binary group; if the first determination unit is triggered, and if the first determination unit is not triggered, the second determination unit is triggered;
A first determining unit configured to determine the candidate binary group as a word segment;
the second determining unit is used for expanding the candidate binary groups to obtain a multi-group, and judging whether the information entropy of the multi-group is larger than the preset probability threshold value or not; if so, the multi-element group is determined to be a word.
15. The apparatus of claim 14, wherein the information entropy comprises left information entropy and right information entropy; the judging unit is specifically configured to: judging whether the left information entropy and the right information entropy of the candidate binary group are both larger than a preset probability threshold value;
the second determining unit is specifically configured to: under the condition that the left information entropy is not greater than the preset probability threshold, expanding the candidate binary group leftwards to obtain a left expanded multi-group, and judging whether the information entropy of the left expanded multi-group is greater than the preset probability threshold; determining the left expanded multi-tuple as a word segmentation under the condition that the information entropy of the left expanded multi-tuple is larger than the preset probability threshold; under the condition that the right information entropy is not greater than the preset probability threshold, expanding the candidate binary group rightward to obtain a right expanded multi-group, and judging whether the information entropy of the right expanded multi-group is greater than the preset probability threshold; and under the condition that the information entropy of the right expanded multi-element group is larger than the preset probability threshold value, determining the right expanded multi-element group as a word segmentation.
16. The apparatus of claim 15, wherein the apparatus further comprises:
the judging module is used for judging whether the length of the left expanded multi-element group reaches a preset length threshold value or not; if not, executing the step of judging whether the information entropy of the left expanded multi-element group is larger than the preset probability threshold value;
judging whether the length of the right expanded multi-element group reaches a preset length threshold value or not; and if not, executing the step of judging whether the information entropy of the right expanded multi-element group is larger than the preset probability threshold value.
17. The apparatus according to claim 10, wherein the generating module is specifically configured to:
determining a context semantic dependency of the word, the dependency comprising any one or more of: the above, below, the fragments to which they pertain;
and generating a word vector sequence of the segmented word according to the context semantic dependency relationship of the segmented word by using a long sequence coding algorithm.
18. The apparatus of claim 10, wherein the apparatus further comprises:
the model training module is used for obtaining training samples, and the training samples comprise: a word vector sequence consisting of word vectors of a plurality of continuous word segments and classification results corresponding to the plurality of continuous word segments; inputting the training sample into a classification model of a preset structure; recording order information of the word vector sequence by utilizing a sequence memory unit in the classification model; generating a prediction signal based on the sequence by the classification model through a mean value collection strategy; and iteratively adjusting parameters of the sequence memory unit based on the prediction signals and the classification results to obtain a recognition model after training.
19. An electronic device comprising a processor and a memory;
a memory for storing a computer program;
a processor for carrying out the method steps of any one of claims 1-9 when executing a program stored on a memory.
20. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored therein a computer program which, when executed by a processor, implements the method steps of any of claims 1-9.
CN201811603465.0A 2018-12-26 2018-12-26 Sensitive word recognition method, device and equipment Active CN111368535B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811603465.0A CN111368535B (en) 2018-12-26 2018-12-26 Sensitive word recognition method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811603465.0A CN111368535B (en) 2018-12-26 2018-12-26 Sensitive word recognition method, device and equipment

Publications (2)

Publication Number Publication Date
CN111368535A CN111368535A (en) 2020-07-03
CN111368535B true CN111368535B (en) 2024-01-16

Family

ID=71206104

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811603465.0A Active CN111368535B (en) 2018-12-26 2018-12-26 Sensitive word recognition method, device and equipment

Country Status (1)

Country Link
CN (1) CN111368535B (en)

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112036167B (en) * 2020-08-25 2023-11-28 腾讯科技(深圳)有限公司 Data processing method, device, server and storage medium
CN111814192B (en) * 2020-08-28 2021-04-27 支付宝(杭州)信息技术有限公司 Training sample generation method and device and sensitive information detection method and device
CN112084306B (en) * 2020-09-10 2023-08-29 北京天融信网络安全技术有限公司 Keyword mining method and device, storage medium and electronic equipment
CN112287684B (en) * 2020-10-30 2024-06-11 中国科学院自动化研究所 Short text auditing method and device for fusion variant word recognition
CN112364216A (en) * 2020-11-23 2021-02-12 上海竞信网络科技有限公司 Edge node content auditing and filtering system and method
CN112612894B (en) * 2020-12-29 2022-03-18 平安科技(深圳)有限公司 Method and device for training intention recognition model, computer equipment and storage medium
CN112906380B (en) * 2021-02-02 2024-09-27 北京有竹居网络技术有限公司 Character recognition method and device in text, readable medium and electronic equipment
CN112801425B (en) * 2021-03-31 2021-07-02 腾讯科技(深圳)有限公司 Method and device for determining information click rate, computer equipment and storage medium
CN113076748B (en) * 2021-04-16 2024-01-19 平安国际智慧城市科技股份有限公司 Bullet screen sensitive word processing method, device, equipment and storage medium
CN113033217B (en) * 2021-04-19 2023-09-15 广州欢网科技有限责任公司 Automatic shielding translation method and device for subtitle sensitive information
CN113128241A (en) * 2021-05-17 2021-07-16 口碑(上海)信息技术有限公司 Text recognition method, device and equipment
CN113449510B (en) * 2021-06-28 2022-12-27 平安科技(深圳)有限公司 Text recognition method, device, equipment and storage medium
CN113407658B (en) * 2021-07-06 2021-12-21 北京容联七陌科技有限公司 Method and system for filtering and replacing text content sensitive words in online customer service scene
CN113536782B (en) * 2021-07-09 2023-12-26 平安国际智慧城市科技股份有限公司 Sensitive word recognition method and device, electronic equipment and storage medium
CN113723096A (en) * 2021-07-23 2021-11-30 智慧芽信息科技(苏州)有限公司 Text recognition method and device, computer-readable storage medium and electronic equipment
CN114117149B (en) * 2021-11-25 2024-08-02 深圳前海微众银行股份有限公司 Sensitive word filtering method and device and storage medium
CN115238044A (en) * 2022-09-21 2022-10-25 广州市千钧网络科技有限公司 Sensitive word detection method, device and equipment and readable storage medium
CN117009533B (en) * 2023-09-27 2023-12-26 戎行技术有限公司 Dark language identification method based on classification extraction and word vector model
CN117077678B (en) * 2023-10-13 2023-12-29 河北神玥软件科技股份有限公司 Sensitive word recognition method, device, equipment and medium
CN117435692A (en) * 2023-11-02 2024-01-23 北京云上曲率科技有限公司 Variant-based antagonism sensitive text recognition method and system
CN117493540A (en) * 2023-12-28 2024-02-02 荣耀终端有限公司 Text matching method, terminal device and computer readable storage medium
CN117592473B (en) * 2024-01-18 2024-04-09 武汉杏仁桉科技有限公司 Harmonic splitting processing method and device for multiple Chinese phrases
CN118468858B (en) * 2024-07-11 2024-09-17 厦门众联世纪股份有限公司 Advertisement corpus contraband word processing method and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104850819A (en) * 2014-02-18 2015-08-19 联想(北京)有限公司 Information processing method and electronic device
CN107168952A (en) * 2017-05-15 2017-09-15 北京百度网讯科技有限公司 Information generating method and device based on artificial intelligence
WO2017185674A1 (en) * 2016-04-29 2017-11-02 乐视控股(北京)有限公司 Method and apparatus for discovering new word
CN108021558A (en) * 2017-12-27 2018-05-11 北京金山安全软件有限公司 Keyword recognition method and device, electronic equipment and storage medium
CN108734159A (en) * 2017-04-18 2018-11-02 苏宁云商集团股份有限公司 The detection method and system of sensitive information in a kind of image
CN108984530A (en) * 2018-07-23 2018-12-11 北京信息科技大学 A kind of detection method and detection system of network sensitive content

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104850819A (en) * 2014-02-18 2015-08-19 联想(北京)有限公司 Information processing method and electronic device
WO2017185674A1 (en) * 2016-04-29 2017-11-02 乐视控股(北京)有限公司 Method and apparatus for discovering new word
CN108734159A (en) * 2017-04-18 2018-11-02 苏宁云商集团股份有限公司 The detection method and system of sensitive information in a kind of image
CN107168952A (en) * 2017-05-15 2017-09-15 北京百度网讯科技有限公司 Information generating method and device based on artificial intelligence
CN108021558A (en) * 2017-12-27 2018-05-11 北京金山安全软件有限公司 Keyword recognition method and device, electronic equipment and storage medium
CN108984530A (en) * 2018-07-23 2018-12-11 北京信息科技大学 A kind of detection method and detection system of network sensitive content

Also Published As

Publication number Publication date
CN111368535A (en) 2020-07-03

Similar Documents

Publication Publication Date Title
CN111368535B (en) Sensitive word recognition method, device and equipment
De Mulder et al. A survey on the application of recurrent neural networks to statistical language modeling
Chien et al. Bayesian recurrent neural network for language modeling
CN111914067B (en) Chinese text matching method and system
CN110263325B (en) Chinese word segmentation system
Wang et al. Sequence modeling via segmentations
CN110619034A (en) Text keyword generation method based on Transformer model
CN112487190B (en) Method for extracting relationships between entities from text based on self-supervision and clustering technology
CN109376222A (en) Question and answer matching degree calculation method, question and answer automatic matching method and device
Lan et al. Three convolutional neural network-based models for learning sentiment word vectors towards sentiment analysis
CN109033084B (en) Semantic hierarchical tree construction method and device
Grzegorczyk Vector representations of text data in deep learning
Noaman et al. Enhancing recurrent neural network-based language models by word tokenization
Yang et al. Sequence-to-sequence prediction of personal computer software by recurrent neural network
Shi A study on neural network language modeling
CN114490954A (en) Document level generation type event extraction method based on task adjustment
De Meulemeester et al. Unsupervised embeddings for categorical variables
Lindén et al. Evaluating combinations of classification algorithms and paragraph vectors for news article classification
CN112835798A (en) Cluster learning method, test step clustering method and related device
Majewski et al. Sentence recognition using artificial neural networks
Sun et al. Chinese microblog sentiment classification based on convolution neural network with content extension method
Shuang et al. Combining word order and cnn-lstm for sentence sentiment classification
Zhang Comparing the Effect of Smoothing and N-gram Order: Finding the Best Way to Combine the Smoothing and Order of N-gram
Jiang et al. Exploration of Tree-based Hierarchical Softmax for Recurrent Language Models.
Kibria et al. Context-driven bengali text generation using conditional language model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 519000 Room 102, 202, 302 and 402, No. 325, Qiandao Ring Road, Tangjiawan Town, high tech Zone, Zhuhai City, Guangdong Province, Room 102 and 202, No. 327 and Room 302, No. 329

Applicant after: Zhuhai Jinshan Digital Network Technology Co.,Ltd.

Address before: 519080 Room 102, No. 325, Qiandao Ring Road, Tangjiawan Town, high tech Zone, Zhuhai City, Guangdong Province

Applicant before: ZHUHAI KINGSOFT ONLINE GAME TECHNOLOGY Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant