CN111368535A

CN111368535A - Sensitive word recognition method, device and equipment

Info

Publication number: CN111368535A
Application number: CN201811603465.0A
Authority: CN
Inventors: 余建兴; 余敏雄; 余赢超; 王焜; 冯毅
Original assignee: Zhuhai Kingsoft Online Game Technology Co Ltd
Current assignee: Zhuhai Kingsoft Online Game Technology Co Ltd
Priority date: 2018-12-26
Filing date: 2018-12-26
Publication date: 2020-07-03
Anticipated expiration: 2038-12-26
Also published as: CN111368535B

Abstract

The embodiment of the invention provides a sensitive word recognition method, a device and equipment, wherein the method comprises the following steps: determining context information corresponding to each character in the participle; generating a word vector sequence of the participle according to the semantic relation in the context information, wherein the word vector sequence comprises the word vector of the participle and the word vector of the participle in the context information; inputting the word vector sequence of the participle into a recognition model obtained by pre-training to obtain a recognition result of whether the participle is a sensitive word; for the sensitive word variants, the context semantic dependency relationship is unchanged even if the font or the pronunciation of the word is changed, so that the variants of the sensitive words can be identified by identifying the sensitive words based on the context semantic dependency relationship in the scheme, and the identification effect is improved.

Description

Sensitive word recognition method, device and equipment

Technical Field

The invention relates to the technical field of word processing, in particular to a sensitive word recognition method, a sensitive word recognition device and sensitive word recognition equipment.

Background

In some internet scenarios, such as internet forums, personal home pages, game chats, etc., the user may publish some text contents to express opinions, express moods, or communicate with other users. In order to create a healthy network environment, it is usually necessary to check the text published by the user, i.e. to identify whether the text contains sensitive words that do not meet the specification.

Existing sensitive word recognition schemes typically include: the method comprises the steps of obtaining the text content published by a user, carrying out segmentation processing on the text content to obtain a plurality of participles, matching each participle with a pre-established sensitive vocabulary library, and if the matching is successful, indicating that the participle is a sensitive word.

At present, a plurality of sensitive word variants exist, the variants are similar to the font or the pronunciation of the sensitive words, and the variants cannot be identified in the scheme, so that the identification effect is poor.

Disclosure of Invention

The embodiment of the invention aims to provide a sensitive word recognition method, a sensitive word recognition device and sensitive word recognition equipment so as to improve the recognition effect.

In order to achieve the above object, an embodiment of the present invention provides a sensitive word recognition method, including:

acquiring a text to be identified;

segmenting the text to be recognized to obtain a plurality of word segments;

determining context information corresponding to each word in each participle aiming at each participle; generating a word vector sequence of the participle according to the semantic relation in the context information, wherein the word vector sequence comprises the word vector of the participle and the word vector of the participle in the context information;

and inputting the word vector sequence of the participle into a recognition model obtained by pre-training to obtain a recognition result of whether the participle is a sensitive word.

Optionally, after the obtaining the text to be recognized, the method further includes:

and performing any one or more of the following preprocessing on the text to be recognized: cleaning characters, turning full angles to half angles, turning traditional Chinese characters into simplified Chinese characters, turning pinyin into characters, combining split characters and restoring harmonic characters to obtain a preprocessed text;

the segmenting processing is carried out on the text to be recognized to obtain a plurality of word segments, and the method comprises the following steps:

and carrying out segmentation processing on the preprocessed text to obtain a plurality of word segments.

iteratively intercepting character strings with preset lengths from the text to be recognized;

aiming at each intercepted character string, matching the character string with a pre-established dictionary tree;

if there is a branch in the trie that matches the character string, the character string is identified as a sensitive word.

Optionally, the segmenting the text to be recognized to obtain a plurality of word segments includes:

calculating mutual information between every two adjacent words in the text to be recognized, wherein the mutual information represents the correlation degree between the adjacent words;

aiming at each piece of mutual information which is larger than a preset correlation threshold value, forming a candidate binary group by two adjacent words corresponding to the mutual information;

and calculating the information entropy of each candidate binary group, and segmenting the text to be recognized according to the calculated information entropy to obtain a plurality of word segments.

Optionally, the segmenting the text to be recognized according to the information entropy obtained by calculation to obtain a plurality of word segments includes:

judging whether the information entropy of each candidate binary group is greater than a preset probability threshold or not according to each candidate binary group;

if so, determining the candidate binary group as a participle;

if not, expanding the candidate binary group to obtain a multi-element group, and judging whether the information entropy of the multi-element group is greater than the preset probability threshold value or not; and if so, determining the multi-element group as a participle.

Optionally, the information entropy includes a left information entropy and a right information entropy; the judging whether the information entropy of the candidate binary group is greater than a preset probability threshold includes:

judging whether the left information entropy and the right information entropy of the candidate binary group are both larger than a preset probability threshold value;

the expanding the candidate binary group to obtain a multi-tuple, and judging whether the information entropy of the multi-tuple is greater than the preset probability threshold value or not comprises the following steps:

under the condition that the left information entropy is not larger than the preset probability threshold, expanding the candidate binary group to the left to obtain a left expanded multi-tuple, and judging whether the information entropy of the left expanded multi-tuple is larger than the preset probability threshold or not;

under the condition that the right information entropy is not larger than the preset probability threshold, expanding the candidate binary group to the right to obtain a right expanded multi-tuple, and judging whether the information entropy of the right expanded multi-tuple is larger than the preset probability threshold or not;

the determining the multi-element group as a participle comprises:

determining the left-extended multi-element group as a word segmentation under the condition that the information entropy of the left-extended multi-element group is larger than the preset probability threshold;

and determining the right expanded multi-element group as a word segmentation under the condition that the information entropy of the right expanded multi-element group is greater than the preset probability threshold.

Optionally, after obtaining the left extended tuple, the method further includes:

judging whether the length of the left expanded tuple reaches a preset length threshold value or not;

if not, executing the step of judging whether the information entropy of the left expanded multi-element group is larger than the preset probability threshold value;

after the obtaining of the right extended tuple, further comprising:

judging whether the length of the right expanded tuple reaches a preset length threshold value or not;

and if not, executing the step of judging whether the information entropy of the right expanded multi-element group is larger than the preset probability threshold value.

Optionally, the determining the context information corresponding to each word in the participle includes:

acquiring stroke information of each character in the participle aiming at each character;

performing characteristic numeralization processing on the stroke information to obtain a multi-element characteristic sequence of the character;

and inputting the multi-element characteristic sequence of the character into a mapping model obtained by preset training to obtain context information corresponding to the character.

Optionally, the generating a word vector sequence of the participle according to the semantic relationship in the context information includes:

determining context semantic dependencies of the participle, the dependencies including any one or more of: above, below, fragments thereof;

and generating a word vector sequence of the participle according to the context semantic dependency relationship of the participle by using a long sequence coding algorithm.

Optionally, the recognition model is obtained by training the following steps:

obtaining training samples, the training samples comprising: the word vector sequence is composed of word vectors of a plurality of continuous word segments, and classification results corresponding to the continuous word segments;

inputting the training sample into a classification model with a preset structure;

recording the order information of the word vector sequence by using a sequence memory unit in the classification model;

generating a sequence-based prediction signal by the classification model through a mean aggregation strategy;

and iteratively adjusting the parameters of the sequence memory unit based on the prediction signal and the classification result to obtain a trained recognition model.

In order to achieve the above object, an embodiment of the present invention further provides a sensitive word recognition apparatus, including:

the acquisition module is used for acquiring a text to be recognized;

the segmentation module is used for carrying out segmentation processing on the text to be recognized to obtain a plurality of segmented words;

the determining module is used for determining context information corresponding to each word in each participle;

a generating module, configured to generate a word vector sequence of the participle according to a semantic relationship in the context information, where the word vector sequence includes a word vector of the participle and a word vector of the participle in the context information;

and the first recognition module is used for inputting the word vector sequence of the participle into a recognition model obtained by pre-training to obtain a recognition result of whether the participle is a sensitive word.

Optionally, the apparatus further comprises:

the preprocessing module is used for preprocessing the text to be recognized, wherein the preprocessing module is used for preprocessing any one or more of the following texts: cleaning characters, turning full angles to half angles, turning traditional Chinese characters into simplified Chinese characters, turning pinyin into characters, combining split characters and restoring harmonic characters to obtain a preprocessed text;

the cutting module is specifically configured to: and carrying out segmentation processing on the preprocessed text to obtain a plurality of word segments.

Optionally, the apparatus further comprises:

the second identification module is used for iteratively intercepting character strings with preset lengths from the text to be identified; aiming at each intercepted character string, matching the character string with a pre-established dictionary tree; if there is a branch in the trie that matches the character string, the character string is identified as a sensitive word.

Optionally, the dividing module includes:

the calculation submodule is used for calculating mutual information between every two adjacent words in the text to be recognized, and the mutual information represents the association degree between the adjacent words;

the composition submodule is used for forming a candidate binary group by two adjacent words corresponding to each piece of mutual information which is larger than a preset association threshold value;

and the segmentation sub-module is used for calculating the information entropy of each candidate binary group, and segmenting the text to be recognized according to the calculated information entropy to obtain a plurality of word segments.

Optionally, the splitting sub-module includes:

the judging unit is used for judging whether the information entropy of each candidate binary group is greater than a preset probability threshold value or not; if the number of the first determination unit is larger than the number of the second determination unit, triggering the first determination unit, and if the number of the first determination unit is not larger than the number of the second determination unit, triggering the second determination unit;

a first determining unit, configured to determine the candidate binary group as a word segmentation;

the second determining unit is used for expanding the candidate binary group to obtain a multi-element group and judging whether the information entropy of the multi-element group is greater than the preset probability threshold value or not; and if so, determining the multi-element group as a participle.

Optionally, the information entropy includes a left information entropy and a right information entropy; the judging unit is specifically configured to: judging whether the left information entropy and the right information entropy of the candidate binary group are both larger than a preset probability threshold value;

the second determining unit is specifically configured to: under the condition that the left information entropy is not larger than the preset probability threshold, expanding the candidate binary group to the left to obtain a left expanded multi-tuple, and judging whether the information entropy of the left expanded multi-tuple is larger than the preset probability threshold or not; determining the left-extended multi-element group as a word segmentation under the condition that the information entropy of the left-extended multi-element group is larger than the preset probability threshold; under the condition that the right information entropy is not larger than the preset probability threshold, expanding the candidate binary group to the right to obtain a right expanded multi-tuple, and judging whether the information entropy of the right expanded multi-tuple is larger than the preset probability threshold or not; and determining the right expanded multi-element group as a word segmentation under the condition that the information entropy of the right expanded multi-element group is greater than the preset probability threshold.

Optionally, the apparatus further comprises:

the judging module is used for judging whether the length of the left expanded multi-tuple reaches a preset length threshold value or not; if not, executing the step of judging whether the information entropy of the left expanded multi-element group is larger than the preset probability threshold value;

judging whether the length of the right expanded tuple reaches a preset length threshold value or not; and if not, executing the step of judging whether the information entropy of the right expanded multi-element group is larger than the preset probability threshold value.

Optionally, the determining module is specifically configured to:

Optionally, the generating module is specifically configured to:

Optionally, the apparatus further comprises:

a model training module for obtaining training samples, the training samples comprising: the word vector sequence is composed of word vectors of a plurality of continuous word segments, and classification results corresponding to the continuous word segments; inputting the training sample into a classification model with a preset structure; recording the order information of the word vector sequence by using a sequence memory unit in the classification model; generating a sequence-based prediction signal by the classification model through a mean aggregation strategy; and iteratively adjusting the parameters of the sequence memory unit based on the prediction signal and the classification result to obtain a trained recognition model.

In order to achieve the above object, an embodiment of the present invention further provides an electronic device, including a processor and a memory;

a memory for storing a computer program;

and the processor is used for realizing any one of the sensitive word recognition methods when executing the program stored in the memory.

In order to achieve the above object, an embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and the computer program, when executed by a processor, implements any one of the above sensitive word recognition methods.

When the embodiment of the invention is applied to sensitive word recognition, the context information corresponding to each word in the participle is determined; generating a word vector sequence of the participle according to the semantic relation in the context information, wherein the word vector sequence comprises the word vector of the participle and the word vector of the participle in the context information; inputting the word vector sequence of the participle into a recognition model obtained by pre-training to obtain a recognition result of whether the participle is a sensitive word; for the sensitive word variants, the context semantic dependency relationship is unchanged even if the font or the pronunciation of the word is changed, so that the variants of the sensitive words can be identified by identifying the sensitive words based on the context semantic dependency relationship in the scheme, and the identification effect is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a first flowchart of a sensitive word recognition method according to an embodiment of the present invention;

FIG. 2 is a diagram of a dictionary tree according to an embodiment of the present invention;

fig. 3 is a schematic flow chart of word segmentation determination according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of another process for determining word segmentation according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an embedded neural network according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating a classification model according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a recognition model provided by an embodiment of the present invention;

fig. 8 is a schematic flowchart of a sensitive word recognition method according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of a sensitive word recognition apparatus according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to solve the above technical problems, embodiments of the present invention provide a method, an apparatus, and a device for recognizing a sensitive word, where the method and the apparatus may be applied to various electronic devices, and are not limited specifically. First, the sensitive word recognition method provided by the embodiment of the present invention is described in detail below. For convenience of description, the execution main body is referred to as an electronic apparatus in the following embodiments.

Fig. 1 is a first flowchart of a sensitive word recognition method according to an embodiment of the present invention, including:

s101: and acquiring a text to be recognized.

For example, the text to be recognized may be the text content published by the user in various scenes such as a web forum, a personal homepage, a game chat, and the like. The text to be recognized can be a Chinese text, an English text, a Japanese text and the like, and the specific language type is not limited.

S102: and carrying out segmentation processing on the text to be recognized to obtain a plurality of word segments.

As an embodiment, after S101, the text to be recognized may be preprocessed by any one or more of the following processes: cleaning characters, turning full angles to half angles, turning traditional Chinese characters into simplified Chinese characters, turning pinyin into characters, combining split characters and restoring harmonic characters to obtain a preprocessed text; thus, S102 is: and carrying out segmentation processing on the preprocessed text to obtain a plurality of word segments.

For example, if the text to be recognized is Chinese text, the character washing may be to remove non-Chinese characters, English letters, chat expressions, "# -. ￥% @", and other characters.

Full angle to half angle: that is, the full-angle character is uniformly converted into the half-angle character, for example, "ABC 123" is converted into "abcabcabcbc 123".

And (3) multiplying and converting into a simplified form: that is, the traditional Chinese character is converted into the simplified Chinese character, for example, the word "Wanwannian" is converted into the word "ten thousand.

Converting pinyin into characters: that is, the alphabetic letters of the pinyin are converted into simplified characters, for example, "yiwannian" is converted into "ten thousand years".

Merging split characters: if the text to be recognized is Chinese text, namely the split characters are combined into normal Chinese characters, for example, the 'Chinese character' is converted into 'wheel'.

And (3) harmonic character restoration: the harmonic characters are restored to normal Chinese characters, for example, the 'early Li period' is converted into 'processor'.

If the pretreatment is carried out by adopting the various modes, the sequence of the various modes is not limited.

For example, some users add characters such as non-Chinese characters, English letters, chat expressions, "# -. ￥% @", and the like among the sensitive words, or convert the sensitive words into full-angle characters, or convert the sensitive words into traditional characters, or convert the sensitive words into pinyin, or convert the sensitive words into split words, or convert the sensitive words into harmonic words, and the like.

As an embodiment, after S101, iteratively intercepting a character string of a preset length from the text to be recognized; aiming at each intercepted character string, matching the character string with a pre-established dictionary tree; if there is a branch in the trie that matches the character string, the character string is identified as a sensitive word.

For example, the dictionary tree may include a root node and a plurality of leaf nodes as shown in fig. 2, where the root node may be empty, or the root node does not include a character, and the leaf nodes include a character, and the characters from the root node to the end leaf node are connected into a character string, where the character string is a sensitive word. The branches of the dictionary tree from the root node to the end leaf node are understood as one branch corresponding to one character string or one branch corresponding to one sensitive word. The character strings, the dictionary tree structure, and the like in fig. 2 are merely examples, and the dictionary tree is not limited.

And iteratively intercepting character strings with preset length from the text to be recognized, wherein the preset length can be the maximum length of the sensitive words in the dictionary tree. For example, assuming that the preset length is 5 characters, assuming that the text to be recognized is "we are good friends", the character strings obtained by iteratively truncating may be: we are good friends and friends. For example only, the text to be recognized is usually a long text content.

The process of matching the cut character strings with the dictionary tree is similar, and the following character string is taken as an example for explanation:

the matching process is a process of traversing the dictionary tree, wherein traversing starts from a root node, and sequentially matches each character in the character string with each leaf node in each branch of the dictionary tree according to the direction (the direction from the root node to a leaf node at the tail end) in the dictionary tree; if the character in the character string is successfully matched with each leaf node in the branch, the branch is matched with the character string.

Specifically, if the number of characters in the character string is greater than the number of leaf nodes in the branch, for example, the character string is "we are good" and one branch of the dictionary tree is "we", then the first two characters of the character string match with both leaf nodes of the branch successfully, which in this case indicates that the branch matches with the character string. If a string matches a branch in the dictionary tree, the string is a sensitive word.

It is to be understood that, in the present embodiment, some generic sensitive words are stored in advance, and these generic sensitive words are stored in the form of a dictionary tree. The dictionary tree structure uses a common prefix (an empty root node), so that the expense of query time can be reduced, and the matching efficiency is improved. In this embodiment, the character strings in the text to be recognized are matched with the dictionary tree, that is, the general sensitive words in the text to be recognized are recognized first, and then the sensitive words in the remaining content are recognized by using the schemes of S102 to S104.

If the above one embodiment is adopted, the text to be recognized is subjected to any one or more of the above pretreatments to obtain the preprocessed text, then the embodiment may be adopted to iteratively intercept the character strings with the preset length from the preprocessed text, and match the character strings with the pre-established dictionary tree. The specific matching process is similar and is not described in detail.

As an embodiment, S102 may include: calculating mutual information between every two adjacent words in the text to be recognized, wherein the mutual information represents the correlation degree between the adjacent words; aiming at each piece of mutual information which is larger than a preset correlation threshold value, forming a candidate binary group by two adjacent words corresponding to the mutual information; and calculating the information entropy of each candidate binary group, and segmenting the text to be recognized according to the calculated information entropy to obtain a plurality of word segments.

For example, the mutual information between every two adjacent words in the text to be recognized can be calculated by using the following formula:

PMI (p (x, y)/p (x)) p (y) cnt (x, y)/cnt (x) cnt (y) (formula 1)

PMI represents mutual information between adjacent words, the larger PMI represents the stronger the relation degree between adjacent words, p (x) and p (y) represent the probability of occurrence of an event x and an event y respectively, p (x, y) represents the probability of simultaneous occurrence of the event x and the event y, and cnt represents a function of statistical frequency.

An association threshold may be preset, and if mutual information between two adjacent words is greater than the association threshold, which indicates that the association degree of the two adjacent words is strong, the two adjacent words are combined into a candidate binary group. The association threshold may be set according to actual conditions, for example, may be 1, and the specific value is not limited. It can be understood that the relevance degree between each character in the sensitive words is strong, a candidate binary group with strong relevance degree can be screened out from the text to be recognized, and the binary group consists of two characters.

In this embodiment, the information entropy of each candidate binary group is calculated, and the information entropy represents the random degree of the left adjacent character set and the right adjacent character set of one text segment. For a word, the more collocated words between the word and the words adjacent to the left and right of the word, the higher the probability that the word and the words adjacent to the left and right of the word belong to different words. For example, the information entropy of the candidate doublet may be calculated using the following equation:

wherein Encopy represents information Entropy, U represents candidate binary, i represents identification of word, and p represents information Entropy_iRepresenting the probability of occurrence of a word identified as i and n representing the number of adjacent words.

For example, taking the sentence "eating grape and not eating grape skin and not eating grape and inversely eating grape skin" as an example, the word "grape" appears four times, wherein the left adjacent characters are { eating, spitting, eating, spitting }, respectively, and the right adjacent characters are { non, skin, inverted, skin }, respectively. Calculated according to equation 2: the information entropy of the left adjacent characters of the word "grape" is- (1/2) · log (1/2) - (1/2) · log (1/2), and the information entropy of the right adjacent characters thereof is- (1/2) · log (1/2) - (1/4) · log (1/4) - (1/4) · log (1/4).

As an embodiment, for each candidate binary group, whether the information entropy of the candidate binary group is greater than a preset probability threshold may be determined: if so, determining the candidate binary group as a participle; if the information entropy of the candidate binary group is not greater than the preset probability threshold, other steps can not be executed, the candidate binary group can also be expanded to obtain a multi-element group, and whether the information entropy of the multi-element group is greater than the preset probability threshold is judged; and if so, determining the multi-element group as a participle.

The probability threshold value can be set according to actual conditions, and specific numerical values are not limited.

In one case, a length threshold may be set, where the length threshold represents a maximum length of a word segmentation, and the length threshold may be set according to an actual situation, for example, the length threshold may be set to 5 characters, and a specific numerical value is not limited. In this case, when the candidate binary group is expanded, the length of the expanded binary group does not exceed the length threshold.

In one case, the information entropy of the candidate binary includes left information entropy and right information entropy, that is, the information entropy of the left neighbor and the information entropy of the right neighbor in the above example. In this case, reference may be made to fig. 3:

s301: and judging whether the left information entropy and the right information entropy of the candidate binary group are both larger than a preset probability threshold, executing S302-S304 if the left information entropy is not larger than the preset probability threshold, executing S305-S307 if the right information entropy is not larger than the preset probability threshold, and executing S308 if the left information entropy and the right information entropy are both larger than the preset probability threshold.

S302: and expanding the candidate binary group to the left to obtain a left expanded binary group.

S303: judging whether the information entropy of the left expanded tuple is larger than a preset probability threshold value or not; if so, S304 is performed.

S304: and determining the left expanded multi-element group as a word segmentation.

S305: and expanding the candidate binary group to the right to obtain a right expanded multi-element group.

S306: judging whether the information entropy of the right expanded tuple is larger than a preset probability threshold value or not; if so, S307 is executed.

S307: and determining the right expanded multi-element group as a word segmentation.

S308: the candidate bigram is determined as a participle.

In one case, after the expanded tuple is obtained, it may be determined whether the expanded tuple reaches a preset length threshold, and if so, the subsequent step may not be performed, and if not, it may be determined whether the information entropy of the expanded tuple is greater than a preset probability threshold.

For example, referring to fig. 4, fig. 4 is based on fig. 3, and S309 is added after S302: judging whether the length of the left expanded tuple reaches a preset length threshold value or not; if the determination result in S309 is not reached, S303 is executed again. Similarly, S310 is added after S305: and judging whether the length of the right expanded tuple reaches a preset length threshold value or not, and executing S306 if the judgment result of S310 is that the length of the right expanded tuple does not reach the preset length threshold value.

It can be understood that the sensitive words are not infinitely long, and if the information entropy is not larger than the probability threshold after being expanded to a certain length, the sensitive words are not expanded, so that unnecessary calculation amount is reduced, and the recognition efficiency is improved.

If the above one embodiment is adopted, the text to be recognized is subjected to any one or more of the above pretreatments to obtain the preprocessed text, the preprocessed text can be subjected to segmentation processing by adopting the embodiment, and the specific segmentation process is similar and is not repeated.

Or, other segmentation methods may also be adopted to segment the text to be recognized or the preprocessed text, and the specific segmentation method is not limited.

S103: and determining context information corresponding to each word in each participle.

The following description will be given taking the processing of a word segmentation as an example: as an embodiment, for each word in the participle, stroke information of the word may be obtained; performing characteristic numeralization processing on the stroke information to obtain a multi-element characteristic sequence of the character; and inputting the multi-element characteristic sequence of the character into a mapping model obtained by preset training to obtain context information corresponding to the character.

For example, a word vector construction method based on stroke features such as radicals and Chinese character components can be adopted, and the constructed word vector includes context information corresponding to a word. These stroke features facilitate the recognition of sensitive word variants with similar glyphs. Each word may be encoded based on stroke characteristics, for example, chinese character strokes may be divided into five categories as follows:

stroke name	Horizontal bar	Vertical	Skimming principle	A method of making a ball	Hook
						Shape of	A	I1	Vertical and horizontal	I	
ID	1	2	3	4	5

Specifically, the text to be recognized may be divided into individual words, and the words may be divided into strokes to obtain stroke information of the words. The stroke information is subjected to characteristic numerical processing, which can be understood as converting strokes into IDs. And then combining the IDs according to the stroke sequence to obtain a multi-element characteristic sequence.

The multivariate feature sequence can be obtained by using an N-gram window sliding mode. For example, after the character numeralization processing is performed on the stroke information of the Chinese character 'pen' to obtain the sequence 1224443533, the 3-element feature sequence (3 strokes) is 122, the 4-element feature sequence (4 strokes) is 1224, and so on. The "multivariate signature sequence" as used herein may be an N-gram signature sequence, i.e., an N-gram encoding of words. For example, with N-5, a 5-membered signature sequence can be as follows:

122
	224
...
	1224
12244
	...

and inputting the obtained multivariate characteristic sequence into a mapping model obtained by preset training to obtain a word vector carrying context information.

For example, an embedded neural network with a preset structure can be trained to obtain the mapping model, referring to fig. 5, the embedded neural network and the mapping model can include an input layer, a hidden layer and an output layer, the dimension of the input layer can be V, the dimension of the hidden layer can be N, the dimension of the output layer can be C × V, C represents half of the length of a context_kThe output of the output layer is x_kOf a word vector { x carrying context information_k-C，x_k-C-1，...，x_k-1，x_k+1，...，x_k+C-1，x_k+CAnd the context length is 2C.

In FIG. 5, there is a weight matrix W between the input layer and the hidden layer_V×N，W_V×NThe ith row of (a) represents the weight of the ith word in a vocabulary table, the vocabulary table comprises some common words, and the vocabulary table can be obtained based on historical chat text statistics; the weight matrix W_V×NIncluding the weight information of all words in the vocabulary, the weight matrix W_V×NThe parameters embedded in the neural network and the mapping model can be understood, and the training process is the process of adjusting the parameters.

An output vector W 'of dimension N × V exists from hidden layer to output layer'_N×V. The hidden layer comprises N nodes, and data obtained by weighting and summing the input layer is input to the nodes of the hidden layer.

For example, the output layers may share weights, such as generating a polynomial distribution of the C-th word by a softmax function, the polynomial including a probability of each word in the context corresponding to the C-th word with the C-th word, which may be calculated by equation 3:

wherein, y_c,jRepresenting a predicted conditional probability distribution for the jth word in context; p (w)_c,j＝w_O,c|w_I) Indicating for a selected word w_ITo say that the jth word in this context is the conditional probability of O, ": by "is meant" equivalent to ".

For example, the word "yellow" is selected, the words in the context have a conditional probability P ("i" yellow ") that the 2 nd word is" one ", and this probability is abbreviated as y _ (" yellow ", 2) in equation 3, i.e., j is 2, indicating the 2 nd word in the context. The softmax function may also be expressed as an exp ()/sum function. u. of_c,jRepresenting the vector of a selected word C as input, output after a hidden layer transformation, where the input layer to hidden layer transformation is h ═ x^TW＝v_w,IThe change from the hidden layer to the output layer is u_c,j＝h.mu_w,j。

The process of training the resulting mapping model may include: acquiring a training text corpus which can not include labeling information, inputting the text corpus into an unadjusted embedded neural network, and iteratively learning a weight w by using a mathematical optimal value-solving algorithm, such as a back propagation and random gradient descent algorithm; wherein, the weight updating iterative formula from the hidden layer to the output layer can refer to formula 4; the iterative formula is updated by the weights of the input layer to the hidden layer with reference to equation 5.

Wherein, w'^(new)Represents the updated weight, w ', from the hidden layer to the output layer'^(old)Representing weights before update of hidden layer to output layer, η representing learning rate in training, y_c,jRepresenting the predicted conditional probability distribution of the jth word in the context, t_c,jRepresenting a statistical frequency probability distribution of the jth word in context, h_iThe ith node representing the hidden layer by calculating y_c,jAnd t_c,jThe model can be continuously adjusted and optimized, so that the prediction probability output by the neural network is continuously close to the real statisticsProbability.

Wherein, w^(new)Representing updated weights, w, of input layer to hidden layer_i,jA learning weight representing a jth word in an ith given word,

representing the weights before update of the input layer to the hidden layer, η representing the learning rate in training, y_c,jRepresenting the predicted conditional probability distribution of the jth word in the context, t_c,jRepresenting a statistical frequency probability distribution of the jth word in context, h_iThe ith node represents the hidden layer, and V represents the number of words in the corpus.

After the iteration converges, the word vector carrying the context information can be calculated according to equation 3. For example, the dimension of the word vector may be 100 dimensions, or may be other dimensions, and is not limited specifically.

In the above example, the output layers share the weight, which can reduce the amount of computation and improve the expression power of the generalized model.

S104: and generating a word vector sequence of the participle according to the semantic relation in the context information. And the word vector sequence comprises the word vector of the participle and the word vector of the participle in the context information.

As an embodiment, S104 may include: determining context semantic dependencies of the participle, the dependencies including any one or more of: above, below, fragments thereof; and generating a word vector sequence of the participle according to the context semantic dependency relationship of the participle by using a long sequence coding algorithm.

In this embodiment, a long-sequence coding algorithm may be adopted, and a word vector sequence of the word segmentation may be generated based on the word vector carrying the context information obtained in S103 in combination with the forgetting factor.

Different word orders express different semantics. For example, assume that text S contains T words, TA sequence of words forming a sentence is denoted as { x }₁，x₂，x₃…x_TExpressing the word vector carrying the context information of the T (T is more than or equal to 1 and less than or equal to T) th word as e_t. Z is calculated in turn by equation 6_t(T is more than or equal to 1 and less than or equal to T), five types of front and back dependency relationships of the word segmentation, such as the upper text, the lower text, the fragment, the combination of the upper text and the fragment, the combination of the lower text and the fragment and the like can be obtained.

Wherein z is_tRepresenting the coding of the sequence from position 1 to position t, i.e. the contextual dependency of the sequence from position 1 to position t, α (0)<α<1) Indicating the forgetting factor, α may be a fixed value between 0 and 1, and is used to indicate the influence of the previous sequence on the current word, reflecting the word order information of the word in the sequence.

z_tSize is | V |, z_tThe size is independent of the length T of the original text S, i.e. any indefinite length text can be assigned an encoded unique representation of the length.

As can be seen from the above, based on the word vector carrying the context information obtained in S103, by using the above encoding process, a word vector sequence of each segmented word segmented from the text to be recognized can be generated.

S105: and inputting the word vector sequence of the participle into a recognition model obtained by pre-training to obtain a recognition result of whether the participle is a sensitive word.

As an embodiment, the recognition model may be obtained by training as follows:

In this embodiment, the recognition model is obtained by training a classification model of a preset structure. The classification model may be a training model based on logistic regression. The logical structure of the classification model can be as shown in fig. 6, and includes a plurality of sequence memory units, and the sequence memory units can record the order information of the word vector sequence. And (3) a training process, namely a process of iteratively adjusting the parameters of the sequence memory unit.

As illustrated with reference to FIG. 6, a training sample may include a plurality of vectors, in one vector [ y, x ]_i-c……x_i+c]For example, x_i-c……x_i+cA word vector sequence consisting of word vectors representing a succession of multiple participles, where x_iWord vectors, x, representing the labeled participles_i-c……x_i-1Respectively representing the word vectors, x, corresponding to c words before the labeled participle_i+1……x_i+cRespectively representing word vectors corresponding to c words after the marked participle; y represents the classification result corresponding to the labeled participle, for example, if the participle is labeled as a sensitive word, y is 1, and if the participle is labeled as a non-sensitive word, y is 0.

Referring to fig. 6, describing the lower training logic, the sequence memory unit records the order information of the word vector sequence, and then generates a sequence-based prediction signal h through a mean aggregation strategy; and calculating the difference between the prediction signal h and the classification result y, wherein the difference is the loss of the classification model, the model training process is the process of minimizing the loss by utilizing feedback adjustment, the loss is fed back to the sequence memory unit, and the parameters of the sequence memory unit are iteratively adjusted based on the loss until the training is finished to obtain the recognition model. The process of model training may also be understood as solving for min (y, h), such as solving for an optimal solution using a reflection propagation algorithm. The recognition model can be understood as a set of weighting parameters from which a prediction value h is calculated, i.e. a classification decision is made.

In one embodiment, after the training samples are input into the classification model, the training samples may be input and transformed. Let the input of the classification model at time t be a feature signal x_tThe transformation formula may be formula 7:

c_in_t＝tanh(W_xcx_t+W_hch_t-1+b_{c_in}) (formula 7)

Wherein, W_xcAnd W_hcRepresenting a weight matrix, b_{c_in}Representing a bias vector, tanh (.) representing a hyperbolic function transform, c _ in_tRepresenting the input characteristic signal x at time t_tIs converted into a signal of h_t-1Representing the input characteristic signal x for the time instant t-1_t-1To output a prediction signal.

Prediction signal at time t and input characteristic signal x corresponding to the prediction signal_tAnd input characteristic signal x at time t-1_t-1Is output to predict signal h_t-1It is related.

The classification model can comprise memory gating, the gating can model the context dependence of the text, and can reduce the interference of other factors to the gradient inside the model in the training process. Specifically, memory gating may memorize sequence order information by equation 8:

wherein, W_xi、W_hi、W_xf、W_hf、W_xoAnd W_hoRepresenting a weight matrix, b_i、b_fAnd b_oA bias vector is represented, g () represents an activation function, and specifically, tanh () may be used as the activation function. i.e. i_tRepresenting input gating for measuring the memorizing of certain information present, f_tIndicating forgetting gating for measuring forgetting of certain information in the past, o_tOutput gating is represented for measuring retention of some information.

In this embodiment, the output result of the classification model may be outputBy performing a transformation, assuming that the output of the model at time t is the prediction signal h_tThe transformation formula may be formula 9:

wherein, c_tRepresenting the input characteristic signal h at time t_tOf the transformed signal i_tRepresenting input gating for measuring the memorizing of certain information present, f_tIndicating forgetting gating for measuring forgetting of certain information in the past, o_tOutput gating is represented for measuring retention of some information.

As can be seen, the result of the input transformation c _ int and the result of the memory gating i are transformed by equation 9_t、f_t、o_tAnd fusing the components together.

The classification model is similar to the recognition model in structure, and reference may be specifically made to fig. 7: suppose that a word vector sequence [ q ] of the participles to be recognized is obtained in S104_i-c……q_i+c]，q_i-c……q_i+cA word vector sequence consisting of word vectors representing a consecutive plurality of participles, wherein q_iA word vector representing the word to be recognized, q_i-c……q_i-1Respectively representing the word vectors corresponding to c words before the word segmentation to be recognized, q_i+1……q_+cRespectively representing the word vectors corresponding to c words after the word segmentation to be recognized.

And inputting the word vector sequence into a recognition model obtained by training, and performing input transformation, memory gating, output transformation and other processing on the word vector sequence to obtain a signal h. The processing procedure of the word vector sequence by the recognition model can also refer to the contents of the input transformation, the memory gating and the output transformation parts. Carrying out category judgment on the signal h to obtain a prediction probability q_θ(h) In that respect The specific category determination formula can refer to equation 10:

where θ represents a model parameter obtained by logistic regression training. In formula 10, whenProbability of measurement q_θ(h) When z is 0.5 or more, z is 1, indicating that the recognition result is: the participles are sensitive words. 0.5 is only a set threshold, and the specific value is not limited.

After S105, the identified sensitive word may be added to a stored sensitive thesaurus. For example, in the above embodiment, the generic sensitive words are stored in the form of a dictionary tree in advance, and after S105, the identified sensitive words may be added to the dictionary tree.

By applying the embodiment of the invention, the context information corresponding to each word in the participle is determined; generating a word vector sequence of the participle according to the semantic relation in the context information, wherein the word vector sequence comprises the word vector of the participle and the word vector of the participle in the context information; inputting the word vector sequence of the participle into a recognition model obtained by pre-training to obtain a recognition result of whether the participle is a sensitive word; for the sensitive word variants, the context semantic dependency relationship is unchanged even if the font or the pronunciation of the word is changed, so that the variants of the sensitive words can be identified by identifying the sensitive words based on the context semantic dependency relationship in the scheme, and the identification effect is improved.

To verify the effect of the embodiments of the present invention, the following experiments were performed: randomly selecting 10 days in one month, randomly screening 5000 chat texts each day, and performing sensitive word recognition on the screened texts. Two indexes of 'accuracy' and 'coverage' are defined in the experiment to measure the recognition effect. Wherein, the accuracy is defined as: identifying the exact number of samples divided by the predicted total number of samples, the coverage ratio being defined as: the exact number of samples identified is divided by the total number of labeled samples.

Experimental data show that the accuracy of the embodiment of the invention is 90.2%, and the coverage rate is 85.2%. The standard deviation of the accuracy is 0.02, the standard deviation of the coverage is 0.14, and the performance is stable. When the traditional character recognition algorithm is applied, due to the fact that word bank updating is delayed, the accuracy rate is 80.1% and the coverage rate is 75.2% at the initial time, and the accuracy rate and the coverage rate are remarkably reduced along with the time; the standard deviation of accuracy was 42.2 and the standard deviation of coverage was 55.1. Therefore, the embodiment of the invention is obviously superior to the traditional algorithm in the aspects of identification accuracy and stability.

Fig. 8 is a schematic flowchart of a second method for recognizing sensitive words according to an embodiment of the present invention, where the method includes:

s801: and acquiring a text to be recognized.

S802: and preprocessing the text to be recognized by any one or more of the following steps: character cleaning, full-angle turning to half-angle turning, traditional Chinese character turning to simple Chinese character turning, pinyin turning to characters, split character merging and harmonic character restoration are carried out, and a preprocessed text is obtained.

S803: and iteratively intercepting character strings with preset lengths from the preprocessed text.

S804: and for each intercepted character string, matching the character string with a pre-established dictionary tree, judging whether a branch matched with the character string exists in the dictionary tree, and if so, executing S805.

S805: and recognizing the character string as a general sensitive word, and determining the preprocessed text which does not comprise the general sensitive word as the text to be processed.

And iteratively intercepting character strings with preset length from the preprocessed text, wherein the preset length can be the maximum length of the sensitive words in the dictionary tree. For example, assuming that the preset length is 5 characters, assuming that the preprocessed text is "we are good friends", the character strings obtained by iterative truncation may be: we are good friends and friends. For example only, the preprocessed text is usually a long text content.

In this embodiment, some generic sensitive words are stored in advance, and these generic sensitive words are stored in the form of a dictionary tree. The dictionary tree structure uses a common prefix (an empty root node), so that the expense of query time can be reduced, and the matching efficiency is improved.

In this embodiment, the character strings in the preprocessed text are matched with the dictionary tree, that is, the general sensitive words in the preprocessed text are recognized first. In one case, the general sensitive words identified in S805 may be removed from the preprocessed text, and the remaining text may be used as the text to be processed, and the subsequent steps may be continuously performed on the text to be processed, so that repeated identification of the identified sensitive words may be avoided, unnecessary computation may be reduced, and the computation efficiency may be improved.

S806: and carrying out segmentation processing on the text to be processed to obtain a plurality of word segments.

As an embodiment, S806 may include: calculating mutual information between every two adjacent words in the text to be processed, wherein the mutual information represents the association degree between the adjacent words; aiming at each piece of mutual information which is larger than a preset correlation threshold value, forming a candidate binary group by two adjacent words corresponding to the mutual information; and calculating the information entropy of each candidate binary group, and segmenting the text to be processed according to the calculated information entropy to obtain a plurality of word segments.

For example, the mutual information between every two adjacent words in the text to be processed can be calculated by using the following formula:

PMI (p (x, y)/p (x)) p (y) cnt (x, y)/cnt (x) cnt (y) (formula 1)

An association threshold may be preset, and if mutual information between two adjacent words is greater than the association threshold, which indicates that the association degree of the two adjacent words is strong, the two adjacent words are combined into a candidate binary group. The association threshold may be set according to actual conditions, for example, may be 1, and the specific value is not limited. It can be understood that the relevance degree between each character in the sensitive words is strong, a candidate binary group with strong relevance degree can be screened out from the text to be processed, and the binary group consists of two characters.

S807: and determining context information corresponding to each word in each participle.

122
	224
...
	1224
12244
	...

An output vector W 'of dimension N × V exists from hidden layer to output layer'_N×And V. The hidden layer comprises N nodes, and data obtained by weighting and summing the input layer is input to the nodes of the hidden layer.

For example, the word "yellow" is selected, the words in the context have a conditional probability P ("i" yellow ") that the 2 nd word is" one ", and this probability is abbreviated as y _ (" yellow ", 2) in equation 3, i.e., j is 2, indicating the 2 nd word in the context. The softmax function may also be expressed as an exp ()/sum function. u. of_c,jRepresenting the output of vectors of selected words C as input, after transformation by a hidden layerVector, where the input layer to hidden layer transform is h ═ x^TW＝v_w,IThe change from the hidden layer to the output layer is u_c,j＝h.mu_w,j。

Wherein, w'^(new)Represents the updated weight, w ', from the hidden layer to the output layer'^(old)Representing weights before update of hidden layer to output layer, η representing learning rate in training, y_c,jRepresenting the predicted conditional probability distribution of the jth word in the context, t_c,jRepresenting a statistical frequency probability distribution of the jth word in context, h_iThe ith node representing the hidden layer by calculating y_c,jAnd t_c,jThe model can be continuously adjusted and optimized, so that the prediction probability output by the neural network is continuously close to the real statistical probability.

representing the weights before update of the input layer to the hidden layer, η representing the learning rate in training, y_c,jRepresenting the predicted conditional probability distribution of the jth word in the context, t_c,jStatistics representing jth word in contextFrequency probability distribution, h_iThe ith node represents the hidden layer, and V represents the number of words in the corpus.

S808: and generating a word vector sequence of the participle according to the semantic relation in the context information. The word vector sequence comprises the word vector of the participle and the word vector of the participle in the context information.

As an embodiment, S808 may include: determining context semantic dependencies of the participle, the dependencies including any one or more of: above, below, fragments thereof; and generating a word vector sequence of the participle according to the context semantic dependency relationship of the participle by using a long sequence coding algorithm.

In this embodiment, a long-sequence coding algorithm may be adopted, and a word vector sequence of the word segmentation may be generated based on the word vector carrying the context information obtained in S807 by combining the forgetting factor.

Different word orders express different semantics. For example, assume that the text S contains T words, and the sentence sequence composed of the T words is represented as { x }₁，x₂，x₃…x_TExpressing the word vector carrying the context information of the T (T is more than or equal to 1 and less than or equal to T) th word as e_t. Z is calculated in turn by equation 6_t(T is more than or equal to 1 and less than or equal to T), five types of front and back dependency relationships of the word segmentation, such as the upper text, the lower text, the fragment, the combination of the upper text and the fragment, the combination of the lower text and the fragment and the like can be obtained.

Wherein z is_tThe code of the sequence from position 1 to position t, namely the front-back dependency relationship of the sequence from position 1 to position t,α(0<α<1) indicating the forgetting factor, α may be a fixed value between 0 and 1, and is used to indicate the influence of the previous sequence on the current word, reflecting the word order information of the word in the sequence.

As can be seen from the above, based on the word vector carrying the context information obtained in S807, by using the above encoding process, a word vector sequence of each segmented word segmented from the text to be recognized can be generated.

S809: and inputting the word vector sequence of the participle into a recognition model obtained by pre-training to obtain a recognition result of whether the participle is a sensitive word.

As an embodiment, the recognition model may be obtained by training as follows:

c_in_t＝tanh(W_xcx_t+W_hch_t-1+b_{c_in}) (formula 7)

In the present embodiment, the output result of the classification model may be output-converted, and the output of the model at time t may be assumed to be the prediction signal h_tThe transformation formula may be formula 9:

As can be seen, the result of the input transformation, c _ int and c _ int, is given by equation 9Memory-gated result i_t、f_t、o_tAnd fusing the components together.

The classification model is similar to the recognition model in structure, and reference may be specifically made to fig. 7: suppose that a word vector sequence [ q ] of the participle to be recognized is obtained in S808_i-c……q_i+c]，q_i-c……q_i+cA word vector sequence consisting of word vectors representing a consecutive plurality of participles, wherein q_iA word vector representing the word to be recognized, q_i-c……q_i-1Respectively representing the word vectors corresponding to c words before the word segmentation to be recognized, q_i+1……q_+cRespectively representing the word vectors corresponding to c words after the word segmentation to be recognized.

where θ represents a model parameter obtained by logistic regression training. In equation 10, when the probability q is predicted_θ(h) When z is 0.5 or more, z is 1, indicating that the recognition result is: the participles are sensitive words. 0.5 is only a set threshold, and the specific value is not limited.

After S809, the identified sensitive word may be added to the stored sensitive word repository. For example, in the above embodiment, the general sensitive words are stored in the form of a dictionary tree in advance, and after S809, the identified sensitive words may be added to the dictionary tree.

By applying the embodiment shown in fig. 8, on the first hand, for the sensitive word variants, even if the font or the pronunciation changes, the context semantic dependency does not change, so that the sensitive words are identified based on the context semantic dependency in the scheme, variants of the sensitive words can be identified, and the identification effect is improved. In the second aspect, through preprocessing, variants of some sensitive words are restored to the sensitive words, and interference of the variants of the sensitive words on recognition is reduced. In the third aspect, the general sensitive words in the text are recognized first, and only the text which does not contain the general sensitive words is subjected to subsequent recognition, so that repeated recognition of the recognized sensitive words is avoided, unnecessary calculation amount is reduced, and calculation efficiency is improved.

Corresponding to the foregoing method embodiment, an embodiment of the present invention further provides a sensitive word recognition apparatus, as shown in fig. 9, including:

an obtaining module 901, configured to obtain a text to be recognized;

a segmentation module 902, configured to perform segmentation processing on the text to be recognized to obtain multiple segmented words;

a determining module 903, configured to determine, for each participle, context information corresponding to each word in the participle;

a generating module 904, configured to generate a word vector sequence of the participle according to the semantic relationship in the context information, where the word vector sequence includes a word vector of the participle and a word vector of the participle in the context information;

the first recognition module 905 is configured to input the word vector sequence of the segmented word into a recognition model obtained through pre-training, so as to obtain a recognition result of whether the segmented word is a sensitive word.

As an embodiment, the apparatus further comprises:

a preprocessing module (not shown in the figure) for preprocessing the text to be recognized, wherein the preprocessing module is used for preprocessing any one or more of the following texts: cleaning characters, turning full angles to half angles, turning traditional Chinese characters into simplified Chinese characters, turning pinyin into characters, combining split characters and restoring harmonic characters to obtain a preprocessed text;

the segmentation module 902 is specifically configured to: and carrying out segmentation processing on the preprocessed text to obtain a plurality of word segments.

As an embodiment, the apparatus further comprises:

a second recognition module (not shown in the figure) for iteratively intercepting character strings with preset lengths from the text to be recognized; aiming at each intercepted character string, matching the character string with a pre-established dictionary tree; if there is a branch in the trie that matches the character string, the character string is identified as a sensitive word.

As an embodiment, the cutting module 902 includes: a computation submodule, a composition submodule and a slicing submodule (not shown in the figure), wherein,

As an embodiment, the slicing submodule includes:

As an embodiment, the information entropy includes a left information entropy and a right information entropy; the judging unit is specifically configured to: judging whether the left information entropy and the right information entropy of the candidate binary group are both larger than a preset probability threshold value;

As an embodiment, the apparatus further comprises:

a determining module (not shown in the figure) configured to determine whether the length of the left extended tuple reaches a preset length threshold; if not, executing the step of judging whether the information entropy of the left expanded multi-element group is larger than the preset probability threshold value;

As an embodiment, the determining module 903 is specifically configured to:

As an implementation manner, the generating module 904 is specifically configured to:

As an embodiment, the apparatus further comprises:

a model training module (not shown) for obtaining training samples, the training samples comprising: the word vector sequence is composed of word vectors of a plurality of continuous word segments, and classification results corresponding to the continuous word segments; inputting the training sample into a classification model with a preset structure; recording the order information of the word vector sequence by using a sequence memory unit in the classification model; generating a sequence-based prediction signal by the classification model through a mean aggregation strategy; and iteratively adjusting the parameters of the sequence memory unit based on the prediction signal and the classification result to obtain a trained recognition model.

An embodiment of the present invention further provides an electronic device, as shown in fig. 10, including a processor 1001 and a memory 1002,

a memory 1002 for storing a computer program;

the processor 1001 is configured to implement any one of the above-described sensitive word recognition methods when executing the program stored in the memory 1002.

The Memory mentioned in the above electronic device may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.

The embodiment of the invention also provides a computer-readable storage medium, wherein a computer program is stored in the computer-readable storage medium, and when being executed by a processor, the computer program realizes any one of the above sensitive word recognition methods.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, the device embodiment, and the computer-readable storage medium embodiment, since they are substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A sensitive word recognition method, comprising:

acquiring a text to be identified;

segmenting the text to be recognized to obtain a plurality of word segments;

2. The method according to claim 1, wherein after the obtaining the text to be recognized, further comprising:

3. The method according to claim 1, wherein after the obtaining the text to be recognized, further comprising:

4. The method according to claim 1, wherein the segmenting the text to be recognized to obtain a plurality of word segments comprises:

5. The method according to claim 4, wherein the segmenting the text to be recognized according to the calculated information entropy to obtain a plurality of word segments comprises:

if so, determining the candidate binary group as a participle;

6. The method of claim 5, wherein the information entropy comprises a left information entropy and a right information entropy; the judging whether the information entropy of the candidate binary group is greater than a preset probability threshold includes:

the determining the multi-element group as a participle comprises:

7. The method of claim 6, further comprising, after said obtaining the left extended tuple:

after the obtaining of the right extended tuple, further comprising:

8. The method of claim 1, wherein determining context information corresponding to each word in the participle comprises:

9. The method according to claim 1, wherein generating a word vector sequence of the segmented word according to the semantic relationship in the context information comprises:

10. The method of claim 1, wherein the recognition model is trained by:

11. A sensitive word recognition apparatus, comprising:

the acquisition module is used for acquiring a text to be recognized;

12. The apparatus of claim 11, further comprising:

13. The apparatus of claim 11, further comprising:

14. The apparatus of claim 11, wherein the slicing module comprises:

15. The apparatus of claim 14, wherein the dicing sub-module comprises:

16. The apparatus of claim 15, wherein the information entropy comprises a left information entropy and a right information entropy; the judging unit is specifically configured to: judging whether the left information entropy and the right information entropy of the candidate binary group are both larger than a preset probability threshold value;

17. The apparatus of claim 16, further comprising:

18. The apparatus of claim 11, wherein the determining module is specifically configured to:

19. The apparatus of claim 11, wherein the generating module is specifically configured to:

20. The apparatus of claim 11, further comprising:

21. An electronic device comprising a processor and a memory;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1-10 when executing a program stored in the memory.

22. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of the claims 1-10.