CN114416925B

CN114416925B - Sensitive word recognition method, device, equipment, storage medium and program product

Info

Publication number: CN114416925B
Application number: CN202210064884.1A
Authority: CN
Inventors: 翟永刚; 刘海东
Original assignee: Peking University; Guangzhou Baiguoyuan Network Technology Co Ltd
Current assignee: Peking University; Guangzhou Baiguoyuan Network Technology Co Ltd
Priority date: 2022-01-20
Filing date: 2022-01-20
Publication date: 2024-07-02
Anticipated expiration: 2042-01-20
Also published as: CN114416925A

Abstract

The application discloses a sensitive word recognition method, a device, equipment, a storage medium and a program product, wherein the method comprises the following steps: determining a word set of a text to be recognized based on a pre-generated domain dictionary database, wherein each word in the word set comprises head position information and tail position information; splitting word forming components of each word in the word collection to obtain word forming components corresponding to each word; acquiring word vectors corresponding to each word and word forming component vectors corresponding to each word forming component; generating an input vector of the word based on the word vector of each word and the word constructing component vector; inputting the head position information, the tail position information and the input vector of each word in the word set into a pre-generated sequence labeling model, and determining a labeling result of each word by the sequence labeling model based on the head position information, the tail position information and the input vector; and identifying the sensitive words according to the labeling results of the words so as to improve the identification accuracy of the sensitive words.

Description

Sensitive word recognition method, device, equipment, storage medium and program product

Technical Field

The present application relates to the field of data processing technology, and in particular, to a method for recognizing a sensitive word, a device for recognizing a sensitive word, an electronic apparatus, a computer-readable storage medium, and a computer program product.

Background

In the field of content auditing of automatic speech recognition, a large number of sensitive texts are generally audited manually at first, sensitive words in the texts are extracted to construct a sensitive word stock, and then whether the sentences contain the sensitive words is judged by searching the sensitive word stock. The method is low in efficiency, sensitive words often change with the environment along with time, so that the artificially constructed sensitive word stock often lacks accuracy, coverage and timeliness, and the detection effect of sensitive information is greatly reduced.

In the related art, a method for extracting a sensitive word based on statistical characteristics is proposed, and most of the methods are based on a calculation method of information entropy, and the method needs to calculate the frequency, solidification degree and freedom degree of words in a text to determine whether the words are sensitive words, so that the method has the defects of complex actual operation calculation and low accuracy.

In recent years, with the continuous development of deep learning technology, some methods for extracting sensitive words in texts by using a deep learning model appear, and as the deep learning model can better utilize context information, the ambiguity problem of words in some sentences can be better solved, and more low-frequency new words can be identified. Most of the methods based on deep learning need word segmentation operation on the text, but word segmentation errors are often caused by new words, and some sensitive words in the corpus belong to the new words, so that the accuracy of the model recognition of the sensitive words is further affected by the errors caused by word segmentation. In addition, the deep learning model for identifying the sensitive words, which is disclosed at present, is basically generated based on an RNN network (Recurrent Neural Network, cyclic neural network), and the network needs iterative operation and cannot be processed in parallel, so that the model operation is low in efficiency and is difficult to capture global semantic information of texts.

Disclosure of Invention

The application provides a sensitive word recognition method, a device, equipment, a storage medium and a program product, which are used for solving the problem of low sensitive word recognition accuracy in the prior art.

In a first aspect, an embodiment of the present application provides a method for identifying a sensitive word, where the method includes:

determining a word set of a text to be recognized based on a pre-generated domain dictionary database, wherein each word in the word set comprises head position information and tail position information;

splitting word forming components of each word in the word set to obtain word forming components corresponding to each word;

Acquiring word vectors corresponding to the words, and acquiring word component vectors corresponding to word components of the words; generating an input vector of each word based on the word vector of the word and the word forming component vector;

Inputting head position information, tail position information and input vectors of each word in the word set into a pre-generated sequence labeling model, determining multidimensional vector representation of each input vector by the sequence labeling model based on the head position information, the tail position information and the input vectors by adopting a relative position coding algorithm, and determining labeling results of each word based on the multidimensional vector representation, wherein the sequence labeling model comprises a transducer encoder and a conditional random field decoder;

And identifying the sensitive words according to the labeling results of the words.

In a second aspect, an embodiment of the present application further provides a sensitive word recognition apparatus, where the apparatus includes:

The word set determining module is used for determining a word set of the text to be recognized based on a pre-generated domain dictionary library, wherein each word in the word set comprises head position information and tail position information;

the word composing component acquisition module is used for splitting the word composing components of each word in the word collection to obtain word composing components corresponding to each word;

the input vector determining module is used for obtaining word vectors corresponding to the words and word forming component vectors corresponding to the word forming components of the words; generating an input vector of each word based on the word vector of the word and the word forming component vector;

The sequence labeling module is used for inputting head position information, tail position information and input vectors of each word in the word set into a pre-generated sequence labeling model, determining multidimensional vector representation of each input vector by the sequence labeling model based on the head position information, the tail position information and the input vectors by adopting a relative position coding algorithm, and determining labeling results of each word based on the multidimensional vector representation, wherein the sequence labeling model comprises a transducer encoder and a conditional random field decoder;

And the sensitive word recognition module is used for recognizing sensitive words according to labeling results of the words.

In a third aspect, an embodiment of the present application further provides an electronic device, including:

one or more processors;

storage means for storing one or more programs,

The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of the first aspect described above.

In a fourth aspect, embodiments of the present application also provide a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the method of the first aspect described above.

In a fifth aspect, embodiments of the present application also provide a computer program product comprising computer executable instructions for implementing the method of the first aspect described above when executed.

The technical scheme provided by the application has the following beneficial effects:

In this embodiment, after the word set of the text to be recognized is obtained, word forming components corresponding to each word are obtained by splitting word forming components for each word in the word set. And then, based on each word and the word forming component corresponding to each word, acquiring a word vector and a word forming component vector of each word, and generating an input vector of the word based on the word vector and the word forming component vector, so that the input vector not only contains information of the word, but also contains information of the word forming component of the word, thereby enriching the input layer representation information of the sequence annotation model and effectively relieving the problem of word ambiguity.

In addition, the sequence labeling model comprises a transducer encoder and a Conditional Random Field (CRF) decoder, and the transducer and the CRF are combined together, so that compared with LSTM and CNN networks, the transducer encoder can process in parallel, the running efficiency of the model is improved, the labeling effect can be improved, and the recognition accuracy of sensitive words is further improved.

In addition, the embodiment obtains multidimensional vector representation of each input vector through a relative position coding algorithm, and can effectively introduce vocabulary information in sentences, so that the sensitive word boundary recognition capability of the model can be improved.

Drawings

FIG. 1 is a flowchart of an embodiment of a method for recognizing a sensitive word according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a sequence annotation model framework according to a first embodiment of the present application;

FIG. 3 is a flowchart of an embodiment of a method for recognizing a sensitive word according to a second embodiment of the present application;

FIG. 4 is a schematic diagram of a transducer encoder framework according to a second embodiment of the present application;

FIG. 5 is a schematic diagram of a vector representation of d-dimension for an input vector according to a second embodiment of the present application;

FIG. 6 is a block diagram illustrating an embodiment of a sensitive word recognition apparatus according to a third embodiment of the present application;

fig. 7 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present application.

Detailed Description

The application is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present application are shown in the drawings.

Example 1

Fig. 1 is a flowchart of an embodiment of a method for recognizing a sensitive word, which is provided in an embodiment of the present application, and the embodiment may be applied to a device for recognizing a sensitive word, where the device may be located in a server or a client, and the embodiment is not limited to this.

As shown in fig. 1, the present embodiment may include the following steps:

Step 110, determining and acquiring a word set of a text to be recognized based on a pre-generated domain dictionary database, wherein each word in the word set comprises head position information and tail position information.

In practice, the text to be identified may have different sources and roles depending on the requirements and application scenarios. For example, the text to be recognized may be text obtained by recognizing the speech via an ASR (Automatic Speech Recognition ) system and denoising the speech (e.g., with some scrambling codes that occur without being properly decoded). For another example, the text to be recognized may also be sample data of the sensitive word recognition model, for training and iterating the sensitive word recognition model. As another example, the text to be identified may also be text to be sent on social software, social networking sites, a network, or to be sent. This embodiment is not limited thereto.

In addition, the text to be recognized may be a sentence, a paragraph, an article, a title, or the like, which is not limited in this embodiment.

A word set is a token set, a word representing a token, a word (token) meaning a word (i.e., a single word) or a word (i.e., a word).

The domain dictionary library can comprise a plurality of words and word vectors corresponding to the words commonly used in the current domain. In one implementation, words commonly used in the current domain, which are usually business unlabeled data, can be collected, and word2vec algorithm is then adopted to pre-generate word vectors of the words to form a domain dictionary library. In one implementation, a dictionary matching method may be used to match text to be identified in a domain dictionary database, thereby extracting a word set.

After the word set is extracted, the head position information and the tail position information of each word in the word set can be determined. The head position information is used for representing the head position of the word in the text to be recognized, and the tail position information is used for representing the tail position of the word in the text to be recognized. For example, "Hilden" in "Hilden leaving Houston airport" is a word whose head position information is 1 and tail position information is 3. If a word is used, the head position information and the tail position information of the word are the same.

And 120, splitting word forming components of each word in the word set to obtain word forming components corresponding to each word.

The character forming part refers to each part forming a character, for example, most of Chinese characters are pictophonetic characters and are formed by pictophonetic sides and phonic sides, so the character forming part mainly comprises pictophonetic sides and phonic sides. Such as the word "comprising" language "and" wu "components; the basin-shaped character consists of two character forming parts, namely a character dividing part and a character bottom part; the "question" word is composed of two word forming parts, namely a "door word frame" and a "mouth". The structural components of each word in the word set are decomposed to obtain a word forming component set of each word.

If the word is a word, the word forming components of each word in the word are split, and the word forming components of each word are combined into a word forming component set of the word.

Step 130, obtaining word vectors corresponding to each word, and obtaining word component vectors corresponding to word components of each word; and generating an input vector for each word based on the word vector for the word and the word building component vector.

This step is used to convert each word into a high-dimensional input vector, which aims to capture the relationship of the words of the text in high-dimensional space.

In this embodiment, the input vector of each word may include a word vector and a word forming component vector. In one implementation, the word vector of each word may be directly spliced with the word building component vector to form the input vector for that word.

In one implementation, the word vector may be obtained by querying a pre-generated dictionary, and the word component vector may be obtained by vectorizing each word component using a set vectorization rule.

According to the embodiment, the word forming component vectors of the characters are fused at the input layer of the model, so that the input layer information of the model is richer, and the problem of word ambiguity can be effectively relieved.

And 140, inputting the head position information, the tail position information and the input vector of each word in the word set into a pre-generated sequence labeling model, determining a multidimensional vector representation of each input vector by adopting a relative position coding algorithm based on the head position information, the tail position information and the input vector by the sequence labeling model, and determining a labeling result of each word based on the multidimensional vector representation.

The embodiment can use a sequence labeling model to label each word. As shown in fig. 2, the sequence labeling model may include at least an input layer, a transducer encoder, and a conditional random field (Conditional Random Fields, abbreviated as CRF) decoder. Input vectors are input into an input layer; the input layer then inputs the input vector into a transform encoder, where the transform encoder encodes the input vector (i.e., feature extraction processing), and finally outputs hidden state information to the CRF decoder. The CRF decoder is used for performing decoding processing such as entity recognition and the like on the received hidden state information and outputting labeling results of each word.

In the sequence labeling model, a relative position coding algorithm is adopted to determine multidimensional vector representation of each input vector according to head position information, tail position information and the input vector, wherein the dimension of the multidimensional vector representation is related to the dimension of the input vector, and vocabulary information in sentences can be effectively introduced through the relative position coding algorithm, so that the sensitive word boundary recognition capability of the model can be improved. The labeling result of each word can be determined according to the multidimensional vector representation of each input vector.

In one embodiment, the sequence annotation model may be trained as follows:

(1) And labeling the acquired sample data set by adopting a bidirectional maximum matching algorithm based on a pre-generated domain dictionary library.

The problem that a large number of marked samples are needed for sequence marking can be effectively solved by marking the sample data set acquired offline or online through the field dictionary library. In the implementation, the domain dictionary library can be perfected in an iterative optimization mode so as to improve the coverage and accuracy of the domain dictionary library.

One way of iterative optimization may be: under the condition of the existing labeling sample, the model predicts a large number of measured sample books, the newly predicted sensitive words of the model are added and supplemented into the field dictionary library, and the method is circulated and iterated repeatedly for a plurality of times to improve the effect of identifying the sensitive new words of the model.

(2) And dividing the marked sample data set into a training data set, a verification data set and a test data set according to a set proportion.

And dividing the marked sample data set into positive and negative samples and proportion for training, verifying and testing the model. The set proportion may be determined according to actual requirements, which is not limited in this embodiment, for example, the training data set, the verification data set and the test data set may be divided in a manner of 8:1:1, where the positive and negative sample ratio corresponding to each data set is 1.2: about 1.

(3) And carrying out data enhancement on the training data set by adopting a preset data enhancement strategy based on entity naming identification.

When the method is implemented, the data enhancement strategy based on NER (NAMED ENTITY Recognition ) task can be adopted by analyzing the characteristics of bad data and service data, so that the ability of the model to recognize deformation sensitive words under the condition of limited data samples is further enhanced.

Illustratively, the data enhancement policy includes one or a combination of the following:

Randomly deleting a certain word in the entity; homonym substitution is carried out on the entity; performing synonym replacement on the entity; co-tag replacement of entities, and so on.

(4) And training the sequence annotation model by adopting the training data set after data enhancement.

(5) And respectively adopting the verification data set and the test data set to verify and test the trained sequence annotation model.

The sequence annotation model can be trained on line and deployed on line after training is completed, and the prediction function of the sensitive words on the real service data set is completed through reasoning of the model.

And step 150, identifying sensitive words according to labeling results of the words.

The labeling result may be a label that labels whether the current word is a sensitive word. Specifically, after the labeling results of the words output by the decoder are obtained, the labeling results for representing the sensitive words can be obtained by analyzing the labeling results, and then the words corresponding to the labeling results for representing the sensitive words are used as the sensitive words. For example, assuming that the labeling result of a certain word is "B-SZ", the word corresponding to the labeling result is "political-related" sensitive word. For another example, assuming that the labeling result of a certain word is "0", it indicates that the word corresponding to the labeling result is not a sensitive word.

Example two

Fig. 3 is a flowchart of an embodiment of a method for identifying a sensitive word according to a second embodiment of the present application, where the embodiment is described in more detail based on the first embodiment, as shown in fig. 3, and the embodiment may include the following steps:

Step 310, in a pre-generated domain dictionary library, performing word matching on the text to be identified by adopting a matching algorithm to obtain a word set of the text to be identified, and obtaining head position information and tail position information of each word in the word set in the text to be identified.

Word matching is carried out on the text to be recognized by adopting a matching algorithm in the field dictionary library, so that a word set of the text to be recognized can be obtained. Illustratively, the matching algorithm may include, for example, but is not limited to: a forward maximum matching algorithm, a reverse maximum matching algorithm, a bi-directional maximum matching algorithm, or the like. As to which matching algorithm is adopted, it can be determined by a developer according to actual service requirements, which is not limited in this embodiment.

According to the method, the word set is determined through the external domain dictionary library, the number of words is enriched, the externally matched vocabulary information can be effectively fused into the model, and the capacity of the model for identifying word boundaries is increased on the premise that the model structure does not need to be dynamically modified.

Step 320, obtaining word vectors corresponding to the matched words from the domain dictionary database.

After the words are matched from the domain dictionary database, word vectors corresponding to the matched words can be obtained from the domain dictionary database.

And 330, searching N-element word vectors corresponding to the words in a pre-generated domain N-element dictionary database, wherein the domain N-element dictionary database comprises a plurality of words commonly used in the current domain and N-element word vectors corresponding to the words.

In this embodiment, the word vectors may also include N-gram word vectors, such as unigram vectors, bigram vectors, trigram vectors, and the like.

In one implementation, after collecting the words commonly used in the current field, a unigram model, a bigram model, a trigram model, or the like may be used to generate N-element word vectors of each word, and the N-element word vectors of each word and the corresponding words form a field N-element dictionary library. The specific gram model may be determined according to the number of words of a single word, for example, if the number of words of a certain word is 2, a bigram vector is generated using the bigram model, and if the number of words is 3, a trigram vector is generated using the trigram model.

By searching each word in the word set in the N-element dictionary library in the field, the N-element word vector of the word can be obtained.

And step 340, splitting word forming components of each word in the word set to obtain word forming components corresponding to each word.

In this step, the components that make up the chinese character have more useful information due to the structured nature of the chinese character. By decomposing each Chinese character in each token according to its Chinese character structural components, each Chinese character can be decomposed into one or more character forming components.

And 350, inputting each word forming component corresponding to each word into a pre-generated word forming component network model, and acquiring a word forming component vector of each word forming component output by the word forming component network model.

In this embodiment, the word-forming component vectors of the respective word-forming components are extracted from the word-forming component network model by inputting the respective word-forming components of the respective token into the word-forming component network model generated in advance. The word forming component network model may be, for example, a CNN (Convolutional Neural Networks, convolutional neural network) model, with the embedded vector features of each word forming component obtained by a max pooling operation and a fully connected layer.

Step 360, concatenating the word vector of each word, the N-element word vector, and the constituent word component vector of each constituent word component into an input vector for the word.

In the step, the word vector, the N-element word vector and the word forming component vector of each word are spliced together, so that the input vector of the word can be obtained.

For example, assuming that the N-gram vector is a bigram vector, for each token, its input vector may be represented as follows:

Wherein,

Wherein i represents the ith token,The word vector is represented by a word vector,The vector of the bigram is represented as,Representing the word forming component vector.

Step 370, inputting the head position information, the tail position information and the input vector of each word in the word set into a pre-generated sequence labeling model, determining a multidimensional vector representation of each input vector by the sequence labeling model based on the head position information, the tail position information and the input vector by adopting a relative position coding algorithm, and determining a labeling result of each word based on the multidimensional vector representation, wherein the sequence labeling model comprises a transducer encoder and a conditional random field decoder, the transducer encoder comprises at least two transducer encoding components, and the transducer encoding components are connected by adopting attention residual errors.

In this embodiment, the sequence labeling model may include an encoder and a decoder, where the encoder adopts a encoder structure of a transformer (i.e., a transformer encoder), as shown in fig. 4, where the transformer encoder may include at least two transformer coding components (two transformer coding components are illustrated in fig. 4, and each transformer coding component may include at least a self-Attention (self-Attention) layer, a feed-forward neural network (feed-forward) layer, a sum-normalization Residual layer (Add & Norm), and a Residual-Attention layer (residual_attention). There is one summed normalized residual layer around each sub-layer of each transducer coding element, and the self-attention layers of adjacent two transducer coding elements have an attention residual connection of the attention residual layer.

In one embodiment, the step of determining a multidimensional vector representation of each input vector using a relative position coding algorithm in step 370 based on the head position information, the tail position information, and the input vector may include the steps of:

And 370-1, determining a relative head position vector representation of the input vector corresponding to the word according to the position of the word in the text to be recognized, the head position information and the dimension of the input vector corresponding to the word.

And 370-2, determining a relative tail position vector representation of the input vector corresponding to the word according to the position of the word in the text to be recognized, the tail position information and the dimension of the input vector corresponding to the word.

In one implementation, the relative head position vector representation or the relative tail position vector representation can be determined by adopting a rotary relative position coding mode, and the calculation mode of the relative position coding is simplified, so that the method is more suitable for predicting actual service data, the calculation amount of the model for calculating the relative position can be reduced, and the reasoning speed of the model in actual application is improved. From the complex multiplication, a relative head position vector representation or a relative tail position vector representation of the input vector may be calculated using the following formula:

Where m denotes head position information (head, hereinafter denoted by h) or tail position (tail, hereinafter denoted by t).

Thus, there are four relative distances between any two token to represent the relationship between them.

In one example, θ may be calculated using the following formula:

θ_i＝10000^-2(i-1)/d,i∈[1,2,…d/2]

Where i is the position of the current word in the text to be recognized, and d is the dimension of the input vector corresponding to the current word.

And step 370-3, constructing a multi-dimensional vector representation of each input vector by adopting a complex rotation position coding mode according to the relative head position information and the relative tail position information of each input vector, wherein the multi-dimensional vector representation of each input vector is related to the dimension number of the input vector.

For example, assuming that the multi-dimensional vector representation is a two-dimensional vector representation, the two-dimensional vector representation (i.e., complex representation of two-dimensional positions) generated from the relative head position information and the relative tail position information of the input vector may be as follows:

in the above formula, h represents relative head position information, and t represents relative tail position information.

If the dimension of the current input vector is d-dimensional, then the rotation position is generalized to the representation of d-dimensional (i.e., the d-dimensional vector representation) as shown in FIG. 5.

Because the transducer model itself does not introduce redundant prior information, the model has redundant connections, even though residual connections with summed normalized residual layers can alleviate the situation that the gradient of the model disappears, but because in the task of identifying sensitive words, corresponding entities are sparse, if multiple layers of transducer coding components are directly stacked, the situation of over-fitting easily occurs on both the business data set and some public data sets. Therefore, the embodiment introduces a residual connection mode based on the attention residual layer on the basis of the existing transform encoder, so that the boundary information of an Entity (Entity) is further enhanced, and further, the model can extract sensitive word information in a text more accurately. In one embodiment, the attention residual may be determined using the following steps:

step S1, an attention score matrix of a current transducer coding component is obtained.

In one embodiment, each input vector may be formed into an input vector matrix, and then the input vector matrix is multiplied by a pre-trained query weight matrix, a key weight matrix, and a value weight matrix, respectively, to obtain a corresponding query vector matrix, key vector matrix, and value vector matrix. And then, based on the query vector matrix, the key vector matrix and the value vector matrix, calculating the attention score by adopting a multi-head attention (multi-headed attention) mechanism, and outputting an attention score matrix.

In another embodiment, the step S1 may further include the steps of:

and determining the attention score of each two input vectors according to the multidimensional vector representation of each input vector, and generating an attention score matrix.

In one implementation, the following formula may be used to calculate the attention score of the pairwise input vector to achieve fusion of the relative location information with the attention mechanism:

Wherein, A d-dimensional vector representation of the current input vector; A d-dimensional vector representation for other input vectors (jth input vector) than the current input vector; w _q is the linear transformation representation obtained after the current word is subjected to linear transformation; w _k is the linear transformation representation obtained after the linear transformation of the kth word; is the current input vector; is the j-th input vector; The attention score for the multidimensional vector representation of the j-th input vector and the multidimensional vector representation of the current input vector.

In practice, the encoding process is performed in a matrix form, so that when the attention score of the pairwise input vector is calculated, an attention score matrix can be finally generated.

Step S2, the attention score matrix of a transducer coding component on the upper layer of the current transducer coding component is obtained.

And step S3, determining a target attention score matrix of the current transducer coding component according to the attention score matrix of the current transducer coding component and the attention score matrix of the upper layer of transducer coding component.

For the task of identifying the sensitive words, the true sensitive words in the sentences are sparse relative to the whole sentences, so that the distribution of attention is as sharp as possible, and the model is more beneficial to identifying the sensitive words. In the embodiment, the attention score matrix calculated in the previous layer is accumulated in the calculation in the subsequent layer in a mode similar to residual connection, the attention score of the target attention score matrix obtained after processing is more sharply distributed, the weight of the sensitive word part entity is further enhanced, further sparsification of the attention distribution is realized, and the model recognition of the boundary information of the sensitive word is facilitated.

In one embodiment, step S3 may further include the steps of:

Taking the parameter value of the appointed model parameter as the attention weight; and according to the attention weight, carrying out weighted summation on the attention score matrix of the current transducer coding component and the attention score matrix of the previous layer of transducer coding component to obtain a target attention score matrix of the current transducer coding component, wherein if the current transducer coding component is the first layer of transducer coding component, the attention score matrix of the previous layer of transducer coding component is determined to be 0.

In one implementation, the parameter value of the specified model parameter may be a value less than the value 1, which is used to adjust the weight of the upper layer attention score at the present layer, increasing the generalization ability of the model. The target attention score matrix may be calculated as follows:

target attention score matrix = (1-parameter value) × attention score matrix of current transducer coding element + parameter value × attention score matrix of previous transducer coding element

Step S4, a value vector matrix is determined based on an input vector matrix formed by the input vectors of the current transducer coding component.

When the method is implemented, the value vector matrix can be a matrix obtained by performing linear transformation processing on an input vector matrix formed by all input vectors.

And step S5, determining attention residual errors based on the value vector matrix and the target attention score matrix.

In one implementation, the target attention score matrix may be normalized by a normalization function, and then the normalized target attention score matrix is multiplied by a value vector matrix, where the obtained matrix is used as the attention residual.

The attention residual can be calculated using the following formula:

Residual_Att(A^*,V,prev)＝softmax((1-β)A^*+β*prev)V

Wherein prev is the attention score matrix of the previous layer of transducer coding component, beta is the parameter value of the designated model parameter, A ^* is the attention score matrix of the current transducer coding component, V is the value vector matrix, and softmax is the normalization function.

After context encoding by a transform encoder, the encoded hidden state is also required to be decoded into a tag sequence corresponding to the text by a CRF decoding layer, and the tag sequence comprises labeling results of each word. The advantage of using CRF as decoder is: the method not only can decode the observation sequence into corresponding labels, but also can model the dependency relationship among the sequence labels.

And step 380, identifying sensitive words according to labeling results of the words, and determining the categories of the sensitive words from the labeling results.

In the step, the labeling results of the words output by the sequence labeling model are analyzed, the labeling results of the sensitive words can be identified, and the words corresponding to the labeling results of the sensitive words are used as the sensitive words. Then, the category of the sensitive word can also be extracted from the labeling result representing the sensitive word. For example, assuming that the labeling result of a word is "B-SZ", it indicates that the word is a sensitive word whose category is administrative-related.

In this embodiment, an external existing domain word library (such as a domain dictionary library and a domain N-ary dictionary library) is utilized, and a sequence labeling model based on a transform code is combined, so that the problem of error propagation caused by word segmentation by adopting a word segmentation model is effectively avoided, and the sequence labeling model can effectively utilize sequence information between words. Meanwhile, as the transducer encoder can be processed in parallel, the operation efficiency of the sequence annotation model is also greatly improved.

In addition, the embodiment also adopts a mode of connecting a plurality of layers of transform coding components based on the attention residual error to sparsify the distribution of the attention score, thereby effectively enhancing the capability of the model for extracting the boundaries of the sensitive words, and further improving the accuracy of the model for identifying the sensitive words.

Compared with the English sequence labeling task, chinese lacks explicit word boundary information and temporal information, so that the accurate and effective determination of the boundary information of the entity is particularly important in practical application. Based on this, when the attention residual error is calculated, the multidimensional vector representation of each input vector is obtained by the multidimensional complex relative position coding mode, so that the vocabulary information in the text is merged into the sequence labeling model, the boundary information of the entity is enhanced, the calculated amount is reduced, and the effect that the word information and the word boundary information can be obtained without word segmentation of the word segmentation model is achieved.

Particularly, the embodiment can make up some illegal low-frequency sensitive words which are difficult to find based on a manual statistics method, so that the coverage of the product for finding illegal low-frequency new words on line is effectively improved.

It should be noted that, the present application applies the model to the sensitive word recognition service, but belongs to the technical field of sequence labeling, and the principles of this embodiment, such as named entity recognition, word segmentation and new word discovery tasks, may be used to enhance the word boundaries to be recognized, so as to facilitate recognition and extraction of the model.

Example III

Fig. 6 is a block diagram of an embodiment of a sensitive word recognition device according to a third embodiment of the present application, which may include the following modules:

a word set determining module 610, configured to determine a word set of a text to be identified based on a pre-generated domain dictionary database, where each word in the word set includes head position information and tail position information;

a word forming component obtaining module 620, configured to split word forming components of each word in the word set to obtain a word forming component corresponding to each word;

The input vector determining module 630 is configured to obtain a word vector corresponding to each word, and obtain a word component vector corresponding to a word component of each word; generating an input vector of each word based on the word vector of the word and the word forming component vector;

A sequence labeling module 640, configured to input head position information, tail position information, and input vectors of each word in the word set into a sequence labeling model that is generated in advance, determine, by the sequence labeling model, a multidimensional vector representation of each input vector using a relative position coding algorithm based on the head position information, the tail position information, and the input vectors, and determine a labeling result of each word based on the multidimensional vector representation, where the sequence labeling model includes a transducer encoder and a conditional random field decoder;

the sensitive word recognition module 650 is configured to recognize a sensitive word according to the labeling result of each word.

In one embodiment, the word set determining module 610 is specifically configured to:

performing word matching on the text to be recognized by adopting a matching algorithm in a pre-generated domain dictionary library to obtain a word set of the text to be recognized, wherein the domain dictionary library comprises a plurality of words commonly used in the current domain and word vectors corresponding to the words;

Acquiring head position information and tail position information of each word in the word set in the text to be identified;

The input vector determination module 630 is specifically configured to:

And obtaining word vectors corresponding to the matched words from the domain dictionary library.

In another embodiment, the input vector determination module 630 is specifically configured to:

searching N-element word vectors corresponding to each word in a pre-generated domain N-element dictionary database, wherein the domain N-element dictionary database comprises a plurality of words commonly used in the current domain and N-element word vectors corresponding to each word.

And inputting each word forming component corresponding to each word into a pre-generated word forming component network model, and obtaining a word forming component vector of each word forming component output by the word forming component network model.

In one embodiment, the transducer encoder comprises at least two transducer encoding components, each of which are connected with each other by using an attention residual;

the apparatus further comprises an attention residual determination module comprising:

The layer attention score matrix acquisition sub-module is used for acquiring the attention score matrix of the current transform coding component;

An upper layer attention score matrix acquisition sub-module, configured to acquire an attention score matrix of a previous layer transform coding component of the current transform coding component;

the target attention score matrix determining submodule is used for determining a target attention score matrix of the current transducer coding assembly according to the attention score matrix of the current transducer coding assembly and the attention score matrix of the upper layer of transducer coding assembly;

The value vector matrix determining submodule is used for determining a value vector matrix based on an input vector matrix formed by the input vectors of the current transducer coding assembly;

an attention residual determination submodule for determining an attention residual based on the value vector matrix and the target attention score matrix.

In one embodiment, the target attention score matrix determination submodule is specifically configured to:

Taking the parameter value of the appointed model parameter as the attention weight;

And according to the attention weight, carrying out weighted summation on the attention score matrix of the current transducer coding component and the attention score matrix of the previous layer of transducer coding component to obtain a target attention score matrix of the current transducer coding component, wherein if the current transducer coding component is the first layer of transducer coding component, the attention score matrix of the previous layer of transducer coding component is determined to be 0.

In one embodiment, the current layer attention score matrix acquisition sub-module is specifically configured to:

In one embodiment, the sequence labeling module 640 is specifically configured to:

Determining relative head position vector representation of the input vector corresponding to the word according to the position of the word in the text to be recognized, the head position information and the dimension of the input vector corresponding to the word;

Determining relative tail position vector representation of the input vector corresponding to the word according to the position of the word in the text to be recognized, the tail position information and the dimension of the input vector corresponding to the word;

And constructing a multi-dimensional vector representation of each input vector by adopting a complex rotation position coding mode according to the relative head position information and the relative tail position information of each input vector, wherein the multi-dimensional vector representation of each input vector is related to the dimension number of the input vector.

In one embodiment, the sequence annotation model is trained as follows:

labeling the collected sample data set by adopting a bidirectional maximum matching algorithm based on a pre-generated domain dictionary library;

dividing the marked sample data set into a training data set, a verification data set and a test data set according to a set proportion;

Carrying out data enhancement on the training data set by adopting a preset data enhancement strategy based on entity naming identification;

Training a sequence annotation model by adopting a training data set after data enhancement;

And respectively adopting the verification data set and the test data set to verify and test the trained sequence annotation model.

In one embodiment, the data enhancement policy includes one or a combination of the following:

randomly deleting a certain word in the entity;

homonym substitution is carried out on the entity;

performing synonym replacement on the entity;

and performing the same label replacement on the entity.

In one embodiment, the apparatus further comprises:

and the sensitive word class identification module is used for determining the class of the sensitive word from the labeling result.

The sensitive word recognition device provided by the embodiment of the application can execute the sensitive word recognition method in the first embodiment or the second embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method.

Example IV

Fig. 7 is a schematic structural diagram of an electronic device according to a fourth embodiment of the present application, and as shown in fig. 7, the electronic device includes a processor 710, a memory 720, an input device 730, and an output device 740; the number of processors 710 in the electronic device may be one or more, one processor 710 being taken as an example in fig. 7; the processor 710, memory 720, input device 730, and output device 740 in the electronic device may be connected by a bus or other means, for example in fig. 7.

The memory 720 is a computer readable storage medium, and may be used to store a software program, a computer executable program, and modules, such as program instructions/modules corresponding to the first or second embodiments of the present application. The processor 710 executes various functional applications of the electronic device and data processing by executing software programs, instructions and modules stored in the memory 720, i.e., implements the methods mentioned in the first or second method embodiments.

Memory 720 may include primarily a program storage area and a data storage area, wherein the program storage area may store an operating system, at least one application program required for functionality; the storage data area may store data created according to the use of the terminal, etc. In addition, memory 720 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some examples, memory 720 may further include memory remotely located relative to processor 710, which may be connected to the device/terminal/server via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 730 may be used to receive input numeric or character information and to generate key signal inputs related to user settings and function control of the electronic device. The output device 740 may include a display device such as a display screen.

Example five

The fifth embodiment of the present application also provides a storage medium containing computer-executable instructions for performing the method of the first or second embodiment of the method described above when executed by a computer processor.

Of course, the storage medium containing computer executable instructions provided in the embodiments of the present application is not limited to the method operations described above, and may also perform related operations in the method provided in any embodiment of the present application.

Example six

The sixth embodiment of the present application also provides a computer program product comprising computer executable instructions for performing the method of the first or second embodiment of the method described above when executed by a computer processor.

Of course, the computer program product provided by the embodiments of the present application, the computer executable instructions of which are not limited to the method operations described above, may also perform the relevant operations in the method provided by any of the embodiments of the present application.

From the above description of embodiments, it will be clear to a person skilled in the art that the present application may be implemented by means of software and necessary general purpose hardware, but of course also by means of hardware, although in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a FLASH Memory (FLASH), a hard disk, or an optical disk of a computer, etc., and include several instructions for causing an electronic device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments of the present application.

It should be noted that, in the embodiment of the apparatus, each unit and module included are only divided according to the functional logic, but not limited to the above-mentioned division, so long as the corresponding function can be implemented; in addition, the specific names of the functional units are also only for distinguishing from each other, and are not used to limit the protection scope of the present application.

Note that the above is only a preferred embodiment of the present application and the technical principle applied. It will be understood by those skilled in the art that the present application is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the application. Therefore, while the application has been described in connection with the above embodiments, the application is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the application, which is set forth in the following claims.

Claims

1. A method for identifying a sensitive word, the method comprising:

2. The method of claim 1, wherein the determining the word set of the text to be recognized based on the pre-generated domain dictionary library comprises:

The obtaining the word vector corresponding to each word comprises the following steps:

3. The method according to claim 1 or 2, wherein the obtaining the word vector corresponding to each word includes:

4. The method according to claim 1 or 2, wherein the obtaining a word-forming component vector corresponding to a word-forming component of each word includes:

5. The method of claim 1, wherein the transducer encoder comprises at least two transducer encoding components, each of the transducer encoding components being connected with each other with a focus residual;

The attention residual is determined as follows:

Acquiring an attention score matrix of a current transducer coding component;

acquiring an attention score matrix of a previous layer of a conversion former coding component of the current conversion former coding component;

Determining a target attention score matrix of the current transducer coding component according to the attention score matrix of the current transducer coding component and the attention score matrix of the previous layer of transducer coding component;

determining a value vector matrix based on an input vector matrix formed by the input vectors of the current transducer coding component;

An attention residual is determined based on the value vector matrix and the target attention score matrix.

6. The method of claim 5, wherein determining the target attention score matrix for the current transducer encoding component based on the attention score matrix for the current transducer encoding component and the attention score matrix for the previous layer of transducer encoding components comprises:

7. The method of claim 5 or 6, wherein the obtaining the attention score matrix of the current transducer encoding component comprises:

8. The method of claim 1, wherein the determining a multi-dimensional vector representation of each input vector using a relative position coding algorithm based on the head position information, the tail position information, and the input vector comprises:

9. The method according to claim 1,2 or 5, wherein the sequence annotation model is trained by:

10. The method of claim 9, wherein the data enhancement policy comprises one or a combination of the following:

randomly deleting a certain word in the entity;

homonym substitution is carried out on the entity;

performing synonym replacement on the entity;

and performing the same label replacement on the entity.

11. The method according to claim 1, wherein after identifying the sensitive word from the labeling result of each word, the method comprises:

and determining the category of the sensitive word from the labeling result.

12. A sensitive word recognition apparatus, the apparatus comprising:

13. An electronic device, the electronic device comprising:

one or more processors;

storage means for storing one or more programs,

The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-11.

14. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1-11.

15. A computer program product comprising computer executable instructions for implementing the method of any one of claims 1-11 when executed.