CN114169330A

CN114169330A - Chinese named entity identification method fusing time sequence convolution and Transformer encoder

Info

Publication number: CN114169330A
Application number: CN202111399845.9A
Authority: CN
Inventors: 孙俊
Original assignee: Yunentropy Education Technology Wuxi Co ltd
Current assignee: Yunentropy Education Technology Wuxi Co ltd
Priority date: 2021-11-24
Filing date: 2021-11-24
Publication date: 2022-03-11
Anticipated expiration: 2041-11-24
Also published as: CN114169330B

Abstract

A Chinese named entity recognition method integrating time sequence convolution and a Transformer encoder belongs to the field of character recognition. First, character and word features are modeled using a flat lattice structure proposed by the predecessor, and absolute position encoding is changed to relative position encoding in a transform encoder to avoid loss of directional information. Secondly, the TCN is used for enhancing the capture of the network model to the position information and acquiring more local context semantic relations. And finally, performing regular constraint on the output distribution of the model by adopting an R-Drop strategy, preventing the model from being over-fitted and improving the generalization capability of the model. Experimental results show that F1 values of the model on the Weibo data set and the MSRA data set respectively reach 61.18% and 94.48%, the model is superior to a traditional model and a reference model, and superiority of the model in Chinese named entity recognition is verified.

Description

Chinese named entity identification method fusing time sequence convolution and Transformer encoder

Technical Field

The invention belongs to the technical field of named entity recognition, and provides a Chinese named entity recognition method integrating a time sequence convolution and a Transformer encoder.

Background

Named entity recognition plays an important role as a basic work of many natural language processing tasks in the fields of complex natural language processing such as event extraction, machine question answering, information retrieval, knowledge graph construction and the like, and has been receiving attention in recent years. Common categories of named entity identification include person name, place name, organization name, time, value, currency, and some proper nouns. Early traditional named entity recognition algorithms were dictionary and rule based methods, and used machine learning to jointly predict entity boundaries and class labels, but this approach is often specific to a particular domain and is inefficient and inflexible. With the improvement of computer performance and the development of deep learning technology, the method based on deep learning is widely applied in the field of natural language processing, and gradually becomes mainstream in the direction of named entity recognition.

The named entity recognition model based on deep learning automatically extracts feature information of a text by using a pre-training word vector without manual extraction, thereby improving the capability of feature expression and data fitting. The Recurrent Neural Network (RNN) is widely applied to named entity tasks due to its advantage in processing sequence class time stream data, and particularly, a Bidirectional Long Short-Term Memory network (BiLSTM) has Bidirectional characterization capability, and can model texts through context information of current words. The model of BiLSTM and CRF (Conditional Random Field) combination proposed by Huang et al [ HUANG Z, XU W, YU K, et al, Bidirective LSTM-CRF mod-els for sequence tagging [ EB/OL ] [2015-08-09]. https:// axiv.org/a-bs/1508.01991. pdf ] is currently the most popular model. The Lattice-LSTM model proposed by Zhang et al [ YUE ZHANG, JIE YANg. Chinese NER using Lattice LSTM [ C ]// In Proceedings of the 56th annular Meeting of the Association for the practical linkage, Stroudsburg, PA: Association for the practical linkage, 2018, Vol (1): 1554. 1564 ] was further improved over the BilSTM model by first using a Lattice structure to combine the model based on character sequence labeling with the model based on word sequence labeling to achieve a significant effect. However, because the loop structure of RNN cannot obtain long-term dependency relationship and cannot be calculated in parallel, which limits the calculation efficiency, many learning models begin to abandon the loop structure and turn into a parallelizable Convolutional Neural Network (CNN) or attention mechanism. Of these, two representative models of interest are the time series Convolution Networks (TCNs) and the transform model proposed by Vaswani et al, which perform in various sequence learning tasks consistently or even better than the results of using RNN models.

The TCN model is a variant of the CNN model, and compared with the Transformer model, the TCN model has a poorer capability of acquiring context information from a text of any length and is a one-way structure, so that the TCN model is less used in a named entity recognition task. However, the TCN gradient is stable, the used memory is low, and the receptive field can be flexibly customized according to different tasks. The Transformer model uses an attention mechanism to construct an encoder-decoder framework, typically only using Transformer's encoder and the decoder using CRF models in the field of named entity recognition. However, the Transformer encoder generally performs on named entity recognition, mainly because a pure self-attention mechanism cannot distinguish position and direction information, which is very important for the Chinese named entity recognition task. Thus, the Transformer encoder incorporates position information into the input, constituting an absolute position code, but it cannot distinguish between words at different positions, i.e. there is no direction information. Hang et al [ YAN H, DENG B, LI X N, et al. TENER: adapting transforming encoder for the purpose of the named entity [ EB/OL ] [2019-12-10]. htt-ps:// axiv.org/abs/1911.04474. pdf ] propose a TENER model in 2019, by adding direction and distance perception to the attention mechanism, and a non-scaling multiplication method, the absolute position coding adopted by the original Transformer encoder is changed into relative position coding, so that the Transformer encoder achieves the effect on the Chinese named entity. Li et al [ XIAOANAN LI, HANG YAN, XIPENG QIU, et al, FLAT: Chin-ese NER using FLAT-Lattice transform [ C ]// In Proceedings of the58th annular Meeting of the Association for the Association In the scientific Ling-uistics, Stroudsburg, PA: Association for the scientific Ling-cs, 2020: 6836-6842 ] proposed FLAT model (Flat-Lattice Transformer) which uses relative position coding on the basis of the TENER model and the Lattice-LSTM model, and introduces word information and Lattice character information using a FLAT lossless structure, have achieved great success In the field of named entity recognition In Chinese. Relative position coding is introduced, so that a Transformer encoder can capture local information among words besides acquiring global characteristics of sentences through a self-attention mechanism. But this relies heavily on position coding which has limited impact and requires manual elaboration, without altering the inherent drawbacks of the Transformer model structure.

The model base based on deep learning is a deep neural network, and when the large-scale deep neural network model is trained, in order to prevent overfitting of the model and improve the generalization capability of the model, regularization technologies such as layer standardization, batch standardization and Dropout are indispensable modules. Among them, Dropout technique is a widely used regularization technique because it only needs to randomly discard a part of neurons in the training process. However, randomly discarding some neurons at a time results in different submodels being generated after each discard, and therefore Dropout operates to a certain extent so that the trained model is a combined constraint of multiple submodels. Based on the randomness brought to the network by the special mode of Dropout, the R-Drop proposed by Liang et al [ LIANG X, WU L, LI J, et al, R-Drop: regulated Drop for neural networks [ EB/OL ] [2021-06-28]. https:// axiv.org/abs/2106.14448. pdf ] acts on the output layer of the model, and further carries out regular constraint on the output predictions of a plurality of submodels, so that the output of the submodels keeps consistent.

In summary, the model based on the transform encoder needs to acquire the direction information of the sequence by introducing relative position coding, but lacks the necessary network model structure to model the local features in the sentence sequence. Meanwhile, as a deep learning model, a Transformer encoder has a large number of parameters, and the problem of overfitting is easy to occur in the training process, so that the model cannot well exert performance on a general data set.

Disclosure of Invention

In view of the above problems, the present invention is directed to a method for identifying a named entity in chinese by fusing a time-series convolution and a transform encoder.

The technical scheme of the invention is as follows:

a Chinese named entity recognition method fusing a time sequence convolution and a Transformer encoder comprises the following steps:

step one, establishing a Transformer-TCN-R-Drop model

The Transformer-TCN-R-Drop model consists of an input layer, an encoding layer and an output layer. The input layer comprises an embedding layer and position codes, the embedding layer adopts a flat lattice structure, and when character vectors are generated, word vectors corresponding to characters are generated simultaneously by combining a dictionary; position coding uses a way of relative position coding of different character or word texts. The encoding layer obtains local and global characteristics of the text through a Transformer encoder and a TCN model, and the ADD operation is adopted to fuse characteristic information captured by the two models, so that a new vector sequence is obtained finally. And the output layer decodes the fused feature vectors by adopting a CRF model to obtain a globally optimal label sequence, and meanwhile, the generalization capability of the model is improved by adopting an R-Drop regularization strategy in the whole training process.

The Transformer encoder is formed by stacking a plurality of encoders, and the structure of each layer of encoder comprises a multi-head self-attention layer, a feedforward network layer, a residual connection and layer standardization.

The TCN model consists of a causal convolution which can be applied to a sequence structure and an expansion convolution and residual module which is applied to memory historical information.

Step two, utilizing the established Transformer-TCN-R-Drop model to identify the Chinese named entity

(1) Potential words corresponding to each character of a sentence are obtained from a dictionary with characters and added into a text sequence;

(2) and converting the final text sequence into a text sequence with a FLAT lattice structure adopted by the FLAT model, wherein X is { X ═ X }₁,...,x_TT is the length of the text sequence after the text sequence contains the word;

(3) a text sequence is defined as a collection of text segments span, wherein one text segment consists of a text token, a head and a tail. The text represents a character or a word, and the head and the tail represent the position index in the original text sequence of the first character and the last character of the text, respectively, wherein the head and the tail are the same for a single character.

(4) The embedding layer vectorizes each text to represent a matrix

d_modelIs the dimension of the text vector.

(5) Encoding the interaction between the text segments defined in step (3) using relative position encoding.

(6) And (4) calculating attention of the text vector matrix vectorized in the step (4) by using a multi-head attention mechanism in a transform encoder, and adding the position code obtained in the step (5) in the calculation process.

(7) Adding output text vector matrixes input after the multi-head attention mechanism is calculated to perform residual connection; meanwhile, layer standardization is adopted for normalization processing.

(8) And (4) outputting the result normalized in the step (7), sending the result into a feedforward network layer, and performing nonlinear conversion by using ReLU activation.

(9) And (5) similarly, residual error connection is carried out on the output after the ReLU activation transformation in the step (8) and the output after the normalization in the step (7), normalization processing is carried out by adopting layer normalization, and the text characteristic A of the Transformer encoder is output.

(10) Local feature information between text vector matrices is obtained using causal convolution and dilated convolution of the TCN.

(11) The TCN regularizes the text vector matrix by adopting a residual error module, one residual error module comprises convolution and nonlinear mapping of two layers, and finally obtained text characteristics B are output.

(12) Using the ADD operation, the two features a and B are fused.

(13) Dropout is added in each layer of the transform encoder and TCN network to regularize the network. And an R-Drop regularization strategy is adopted to avoid the problem of inconsistent output distribution of the training model caused by Drop.

(14) And outputting the label sequence by using a Softmax function and a CRF layer to obtain a final identification result.

The invention has the beneficial effects that:

1. on the premise of not changing the parallelism of the model and capturing the long-distance dependency relationship, a TCN and a Tran-sformer encoder are fused on an encoding layer, the implicit position information is learned by utilizing the advantages of the TCN, more local information among characters and words is captured, and therefore the TCN and the Tran-sformer encoder can better complement the global and partial local information captured by the relative position encoding, and finally extracted vocabulary features are more sufficient.

2. Under the condition of not influencing network neurons or model parameters, after the CRF model of the output layer outputs a prediction result, the R-Drop regularization technology is used for reducing the parameter freedom of the model and improving the robustness of the model. Experimental results prove that on the two types of general data sets, the accuracy, the recall rate and the F1 value of the model provided by the invention are improved under the condition of not depending on external resources and pre-training language models.

Drawings

FIG. 1 is a Transformer-TCN-R-Drop model framework.

Fig. 2 is a text sequence of a flat lattice structure.

FIG. 3 is a transform encoder architecture.

FIG. 4 shows the TCN model structure.

FIG. 5 shows the R-Drop process.

Detailed Description

1. Transformer-TCN-R-Drop model

The overall model framework proposed by the invention is shown in fig. 1 and mainly comprises an input layer, an encoding layer and an output layer.

The input layer comprises an embedding layer and position codes, the embedding layer adopts a flat lattice structure, and when character vectors are generated, word vectors corresponding to characters are generated simultaneously by combining a dictionary; position coding uses a way of relative position coding of different character or word text segments. The encoding layer obtains local and global characteristics of the text through a Transformer encoder and a TCN model, and the ADD operation is adopted to fuse characteristic information captured by the two models, so that a new vector sequence is obtained finally. And the output layer decodes the fused feature vectors by adopting a CRF model to obtain a globally optimal label sequence, and meanwhile, the generalization capability of the model is improved by adopting an R-Drop regularization strategy in the whole training process.

1.1 input layer

1.1.1 Embedded layers

Before the text is sent into the Transformer-TCN-R-Drop model, vectorization representation needs to be carried out, namely each word in the text is mapped into a fixed-length dimension vector, which is the function of an embedding layer. In the named entity recognition task, errors of word segmentation can be well avoided by utilizing dictionary information, such as: the words "Chongqing" and "people and pharmacy" in "Chongqing people and pharmacy" can eliminate the potential erroneous entity "Chongqing people". Therefore, before the text vectorization representation, all potential words corresponding to each character in the text are obtained from a dictionary with characters, and the final text sequence is converted into a text sequence X ═ { X } of a FLAT lattice structure adopted by a FLAT model₁,...,x_TT is the length of the text sequence after the word is included, as shown in fig. 2. Defining a text sequence X as a set of a plurality of text segments span, wherein one text segment consists of a text token, a head and a tail; the text represents a character or a word, and the head and the tail represent the position indexes of the first character and the last character of the text in the original text sequence respectively, wherein the head and the tail are the same for a single character; finally, each text segment is vectorized and expressed into a matrix

d_modelIs the dimension of the text vector.

1.1.2 position coding

To prevent subsequent processing from losing the position information of the text before the word-word vectors are input to the encoder, the Transformer encoder adds an additional position code to represent the absolute position information of each word in the text sequence. But the input text sequence of the invention is flat and consists of text segments of different lengths. The method with reference to the FLAT model therefore introduces a relative position coding in order to encode the interaction between text segments. For two text segments X in a text sequence X_iAnd x_jThey have three relationships: intersecting, containing and separating. To better represent the relationship between different text segments and the distance between characters and words, a dense vector model is usedTheir relationship is to be drawn:

wherein head [ i]、tail[i]Respectively represent x_iThe head position and the tail position of the body,

denotes x_iHead position and x_jDistance between head positions, other

Have similar meanings. For example: "heavy" and "Chongqing",

from the four relative distances, the fact that the word of 'Chong' is in the word of 'Chongqing' and is an inclusion relation can be judged, so that the word of 'Chong' pays more attention to the word of 'Chongqing' in an encoder, and the boundary of an entity can be better identified. The final relative position coding of the text segment is a simple nonlinear transformation after four distance concatenations:

wherein, W_rIt is the parameter that can be learned that,

indicating a splicing operation, P_dAccording to Vaswani et al [ YAN H, DENG B, LI X N, et al. TENER: adapting transformerencoder for nano-entry registration [ EB/OL ]].[2019-12-10].htt-ps://arxiv.org/abs/1911.04474.pdf.]And (4) calculating.

Wherein d represents

k is the dimension number of the word-word vector, and the value range is [0, d_model/2]。

1.2 coding layer

1.2.1Transformer encoder

The main purpose of the Transformer encoder is to calculate the relationship between characters and words in the text, so that the model can learn the relationship between words and the importance degree of each word and each word, thereby acquiring global and local feature information. The Transformer encoder is formed by stacking a plurality of encoders, the structure of each layer of encoder is mainly composed of a multi-head self-attention layer (multi-area self-attention), a Feed-forward network layer (Feed-forward network), a Residual connection (Residual connection) and a layer normalization (layer normalization) which are arranged in a penetrating way, as shown in fig. 3, and the core is the multi-head attention mechanism.

The Transformer encoder is originally to add the position information and the word-word vector and then directly send the added position information to the self-attention layer of the encoder, because the current position information is relative position coding, if the text segments of the relative positions are added with the same position coding, the position relation between the text segments cannot be distinguished, the absolute position coding is abandoned, and the relative position coding and the text vector are separately added into the self-attention layer of the encoder, the invention abandons the absolute position codingThe attention layer is calculated, in other words, when calculating the current position vector, the relative position relation of the text segments depending on the current position vector is considered. The specific operation is that a vector matrix E is input_XWeight matrices W different from three_q、W_k、W_vAnd multiplying to obtain three vectors with the same dimensionality, namely a Query vector (Q), a Key vector (K) and a Value vector (V).

[Q,K,V]＝E_X[W_q,W_k,W_v] (8)

When the attention score is calculated, only the relative position relation between the Query vector and the Key vector is considered, the relative position relation is added into the attention calculation of each layer of the transducer encoder from the attention layer, then the attention Value is obtained by normalizing the attention Value by using a Softmax function, the attention Value is multiplied by the Value vector, and finally the weighted sum of all the text vectors is output. Each character vector contains not only information of other characters but also word information, position information, and distance information:

A_ij＝(Q_i+u)^TK_j+(Q_i+v)^TR_ijW_R (9)

Att(A,V)＝softmax(A)V (10)

wherein u, v and W_RAre learnable parameters. The Transformer encoder is a result of performing attention calculation on a text sequence by a plurality of attention heads without sharing a weight matrix and then connecting the plurality of attention heads.

Multihead(A)＝[head₁,head₂,...,head_n]W (11)

Wherein i is the number of attention heads, and i takes the value of [1, n]. And then the signal is processed by a feedforward network layer FFN, wherein the FFN is a position multi-layer perceptron with nonlinear transformation and can increase the nonlinear expression capability of the model. Meanwhile, in order to solve the problem of degeneration of deep network training,adding residual connection and layer standardization after a multi-head self-attention layer and a feedforward network layer to finally obtain a new sentence matrix: z ═ Z₁,z₂,...,z_T]，

1.2.2TCN model

Because the Transformer encoder introduces relative position coding, the extraction of the Transformer encoder on the local characteristic information of the text sequence is solved, but the local information obtained by the position coding is limited, and the Transformer encoder depends heavily on external dictionary information, so that the local information of the vocabulary cannot be learned through the model structure. The TCN model based on convolution can reserve the relative position between vocabularies through convolution kernel, does not need to additionally introduce position coding set manually, and has the capacity of capturing long-term dependency of text sequences and parallel computation. Therefore, the TCN model is added into the encoder, so that the local features of the text can be extracted more flexibly, and the vector information captured by the transform encoder is supplemented.

The structure of the TCN model is shown in fig. 4, and is mainly composed of a causal convolution applicable to a sequence structure and a dilation convolution and residual module applicable to memory of history information. The causal convolution is characterized in that the output length of the text is equal to the input length, and future information is not considered.

Input text matrix E_X{x₁,x₂,...,x_TF ═ F }, filter F ═ F₁,f₂,...,f_O) Where O is the size of the convolution kernel at x_iThe causal convolution of (a) is:

in order to make the output generated by the network the same as the input length, the TCN adopts a one-dimensional full convolution network FCN structure, in which each hidden layer has the same length as the input layer, and the length is padded with zeros to keep the subsequent layer the same as the previous layer. The structure of causal convolution and FCN has a disadvantage in that it is necessary to construct a very deep network or very large filter in order to obtain very long-lived textual information, and thus TCN introduces dilated convolution to obtain long-lived historical information.

The dilated convolution exponentially increases the receptive field by spacing the number of convolution kernels, allowing the output of each convolution to contain a larger range of information, thereby enabling the output to represent a larger range of input features, and capturing longer-range dependencies. Input text matrix E_X{x₁,x₂,...,x_TF ═ F for filtering₁,f₂,...,f_O) Then at x_iThe dilated convolution of (d) is:

where d is the dilation factor, O is the size of the convolution kernel, and i- (O-O) d represents the direction of the history. The dilation convolution thus amounts to introducing a fixed step size between each two adjacent filter taps. By controlling the size of d, the receptive field is widened under the premise of keeping the calculation amount unchanged, and when d is 1, the expansion convolution is degenerated into the common convolution. And finally, adding the identity mapping of cross-layer link in the residual error network by the TCN in order to improve the accuracy. Finally, the output sentence matrix of the model is: b ═ B₁,b₂,...,b_T]，

The TCN model is mainly used for more flexibly obtaining more local features of an input text sequence through the receptive field of the TCN model, and in the Chinese named entity recognition, the label judgment of each character in a sentence not only considers the overall global features of the sentence, but also considers the local features of surrounding characters and words. Therefore, the text feature vector output by the fused TCN model can enable the text feature output by the model to have richer context semantic information. In order not to increase extra calculation amount, the invention adopts an ADD characteristic fusion strategy to combine a Transformer encoder with a transform encoderAnd fusing text features output by the TCN model to obtain a final text representation matrix H ═ z₁+b₁),(z₂+b₂),...,(z_T+b_T)]，

1.3 output layer

1.3.1CRF layer

Based on the coding layer, only the word vector containing the context information can be obtained, and even if the word vector and the relative position coding are added, the dependency relationship between the final prediction labels cannot be considered. Therefore, the model adopts a CRF layer to obtain a globally optimal label sequence by considering the adjacent relation between labels. CRF is a discriminant model based on conditional probabilities, assuming that the output of the model, i.e., the input sequence of CRF, is X ═ X (X)₁,x₂,...,x_T) Wherein one possible predicted tag sequence is Y ═ (Y)₁,y₂,...,y_T) Defining the evaluation score s as:

wherein the content of the first and second substances,

is the escape probability from label i to label j;

is the y-th of the character_iThe score of each tag. The probability P that the sequence Y occurs in all possible predicted sequences is:

wherein the content of the first and second substances,

is a possible predicted sequence observation; y is_tIs the set of observations for all possible occurrences of the predicted sequence for input sequence X. In the CRF training, a maximum likelihood estimation method is introduced to define a loss function:

when predicting the CRF, selecting the candidate label sequence with the maximum probability as a final result according to the trained parameters.

1.3.2R-Drop

Dropout is used in both the transform coder and the TCN model, and the Dropout causes the output distribution of the model to be different each time, so the invention adds R-Drop in the model to constrain the model to keep the output consistent.

The R-Drop process is illustrated in FIG. 5, because the Dropout method randomly discards some neurons in each layer of the neural network, Dropout randomly selects some neurons in each layer of the two models to discard in the multi-headed self-attention layer and feedforward network layer of the transform encoder in the encoder of the global model of the invention, and in the causal convolution and dilation convolution of the TCN model. Because the neurons discarded each time are different, this operation makes the trained global model a combined constraint of multiple submodels (from the same global model), and the combination of models can improve the performance of the global model.

In particular, training data

n is the number of training samples, given the input data x for each training step_iX is to be_iTwo different output predictions are obtained by the CRF layer through two times of network forward propagation, which are respectively P_θ(y_i|x_i) And P_θ'(y_i|x_i). Since Dropout randomly discards some neurons at a time, as shown in FIG. 5, the left output prediction P_θDiscarded neurons in each layer and right output prediction P_θ' discarded nerveThe elements are different. Thus for the same input data x_iTwo different prediction probabilities are obtained through two different sub-models (from the same overall model). R-Drop regularizes the output distribution of the training model by minimizing the symmetric Kullback-Leibler (KL) divergence between the two prediction probabilities.

Plus the cross entropy loss function of the model itself:

with the CRF loss function added, the final R-Drop training loss function is:

wherein alpha is used to control

Coefficient weight of (2). In this way, R-Drop further regularizes the model space beyond Drop, improving the generalization capability of the model. When Dropout is used, the training phase and the prediction phase of the model will be different. During training, the starting of Dropout enables the output of each submodel to be close to the real distribution, namely model averaging; however, at test time, the closing of Dropout causes the model to be averaged only over the parameter space, so there is inconsistency in training and testing. The R-Drop constraint parameter space constrains the output of the submodels in the training process, so that the outputs of different submodels can be consistent, the inconsistency of training and testing is reduced, and the performance of the model is improved after Drop is closed in the testing stage. The algorithm for R-Drop to compute the loss during the training phase is as follows:

2. experiment and result analysis

In order to verify the effectiveness of the Transformer-TCN-R-Drop Chinese named entity recognition model provided by the invention, experiments are carried out on two different types of Chinese named entity recognition universal data sets, namely a Weibo data set and an MSRA data set. And the accuracy, the recall rate and the F1 value are used as main evaluation indexes to ensure the correctness and consistency of the experimental result, so that the effect of the model is verified.

2.1 evaluation index

The invention adopts commonly used indexes in named entity recognition tasks, such as precision rate (P), recall rate (R) and F1 value (F) as evaluation indexes of model performance, and a calculation formula of each index is as follows:

wherein, TP represents the number of entities correctly predicted by the model, FP represents the number of entities predicted by the model but predicted wrongly, and FN represents the number of entities actually predicted by the model but not predicted by the model.

2.2 Experimental data

The Weibo dataset and the MSRA dataset used in the experiment are public Chinese named entity recognition universal datasets. The Weibo dataset is a social media class dataset that contains four classes of entities, political, person, place, and organization names. The MSRA dataset is a microsoft published dataset that contains three types of entities, a person name, a place name, and an organizational name. The detailed statistics of the two data sets are shown in table 1.

Table 1 data set statistics

2.3 Experimental Environment and parameter settings

An experimental model is built based on an open source natural language processing framework FastNLP provided by the university of Compound Dan, the specific experimental environment is shown in Table 2, and the hyper-parameters adopted by the model in the experiment are shown in Table 3.

TABLE 2 Experimental Environment

TABLE 3 model hyper-parameter settings

The performance of the model is sensitive to the learning rate and the value of alpha, the dimensions of hidden layers set by the microblog data set and the MSRA data set are different, the microblog data set is 128, and the MSRA is 160. Through multiple times of training, a group of super parameters with good effect is obtained, wherein the learning rate of 0.003 is selected to ensure the stability of the training, Dropout of each module is set, Dropout of a character and word combination part is set to be 0.5, Dropout of a CRF output layer is set to be 0.3, and a super parameter coefficient alpha of R-Drop is set to be 3.

2.4 results and analysis of the experiment

2.4.1 comparison results and analysis of the model itself

To verify the effectiveness of the TCN model and R-Drop, ablation tests were performed on the Weibo and MSRA datasets, respectively, and compared to the traditional model (BilSTM) and the baseline model (FLAT), using the F1 value as an evaluation index. The reference model is based on a relative position coding adopted by a Transformer encoder, and a TCN model and an R-Drop are not fused. As shown in Table 4, both the Weibo and MSRA data sets were less effective in the FLAT model than the model with the TCN structure added. The F1 values were boosted to different degrees across the two different types of data sets and different depth models, regardless of whether the models were introduced R-Drop alone or added TCN and R-Drop simultaneously.

Table 4 model ablation experimental results

After the introduction of the two modules, the model based on the Transformer-TCN-R-Drop is significantly improved over the baseline model, and the F1 values are 4.43% and 2.61% higher than the conventional model and 0.86% and 0.36% higher than the baseline model on both data sets, respectively. Therefore, the model provided by the invention is proved to be effective, the context information acquired by the Transformer is more effective due to the local information acquired by the TCN, and the capability of acquiring long-distance dependency and parallelization calculation is reserved. The R-Drop carries out regular constraint on the output of the submodel generated due to Dropout randomness through KL divergence, the freedom degree of parameters is reduced, and therefore the generalization capability of the model is enhanced. Therefore, the advantages of the Transformer-TCN-R-Drop are combined, and the overall performance of the model is improved.

2.4.2 comparison results and analysis of different models

In order to verify the effectiveness of the Transformer model fusing the TCN and the R-Drop, a comparison experiment of the model and the current mainstream model is carried out on two kinds of data sets, and the comparison indexes are the values of the precision rate P, the recall rate R and the F1. Table 5 is a comparison of different models on the Weibo dataset and table 6 is a comparison of different models on the MSRA dataset listing the experimental results for each model on both types of dataset separately. The invention selects the comparative model as follows:

1) the Lattice LSTM model improves the performance of chinese named entity recognition by encoding and matching words in a lexicon. But cannot capture long-distance dependency and there is a certain loss of information.

2) The CAN-NER model, the LR-CNN model and the ID-CNN model are the best methods for enhancing word information by using a convolution model so as to improve the model identification performance at present, but have information loss to a certain degree like the Lattice LSTM.

3) The CGN model is used for enhancing the recognition effect of the Chinese named entity by capturing dictionary word information in an all-round way based on a cooperative graph neural network, but needs an RNN as a bottom encoder to capture the orderliness of sentences.

4) The Transformer + relalativateposition + CRF model, the PLT model and the FLAT model are all based on character and word information enhanced by a Transformer encoder. The first model and the FLAT model are both the operation mode that the position coding of the Transformer model is changed from absolute position to relative position and the attention is modified; the PLT model introduces a porous mechanism to enhance local modeling and maintain the ability to capture long-term dependencies.

TABLE 5 comparison of different models on Weibo data set

Table 6 comparison of different models on MSRA dataset

As can be seen from the comparison of the comprehensive experiments in tables 5 and 6, the Transformer model fused TCN and R-Drop has improved accuracy, recall rate and F1 value compared with other models.

1) The CAN-NER, LR-CNN and ID-CNN-CRF models are compared with the Lattice LSTM model, the model effect of the added convolution module is higher than that of the Lattice LSTM model, the enhancement of the convolution on local vocabulary information and the learning of implicit position information are explained, and context semantic information and relation of more words and phrases CAN be obtained.

2) The transform + relativeposition + CRF models, PLT models and FLAT models, compared to other models, have been found to extract features more strongly from the transform encoder that is coded instead in relative position than from LSTM, convolutional and graph networks. The improved multi-head self-attention mechanism can capture the dependency relationship among core entities, selectively focus on high-value information among vocabularies, and accordingly more highly-relevant vocabulary information in the input is considered in the output process.

3) And comparing the Transformer-TCN-R-Drop model with all models, wherein after the TCN model and the R-Drop are added, the three indexes of the model are all higher than those of the comparison model. Compared with the best model, on the Weibo data set, the accuracy rate is improved by 0.86%, the recall rate is improved by 4%, and the F1 value is improved by 0.86%; on the MSRA data set, the accuracy rate is improved by 0.76%, the recall rate is improved by 1.65%, and the F1 value is improved by 0.36%. The TCN is fully proved to be more fully extracted from the input vocabulary features through the local features dynamically captured by the convolution receptive field and the new features generated after the local features are fused with the features captured by the transform coder, and the method is more beneficial to the classification of the labels. Meanwhile, the R-Drop regularization strategy controls the degree of freedom of the model parameters and is complementary with a Drop out method, so that overfitting of the model is better prevented, the generalization capability of the model is improved, and the recognition capability of the model on different data sets is improved.

The invention provides a TCN-based Chinese named entity recognition model based on a Transformer encoder, which overcomes the problems of position and direction information loss, model structure defect and the like of an original Transformer encoder, and fully utilizes the directionality of relative position encoding, the capability of capturing character-word characteristics by a self-attention mechanism and the capability of extracting local information of wild sentences through a TCN model. And meanwhile, the robustness of the model is improved by the introduced R-Drop strategy. Experimental results show that the method provided by the invention has better effect on two Chinese data sets than the former model, and verifies the effectiveness of the change. In subsequent work, other external information such as character structures, sememes and the like is considered to be introduced, so that the model is further optimized, and the recognition capability of the model is improved.

Claims

1. The method for identifying the Chinese named entity by fusing the time sequence convolution and the Transformer encoder is characterized by comprising the following steps of:

step one, establishing a Transformer-TCN-R-Drop model

The Transformer-TCN-R-Drop model consists of an input layer, an encoding layer and an output layer; the input layer comprises an embedding layer and position codes, the embedding layer adopts a flat lattice structure, and when character vectors are generated, word vectors corresponding to characters are generated simultaneously by combining a dictionary; the position coding adopts a mode of carrying out relative position coding on different character or word texts; the encoding layer obtains local and global characteristics of the text through a Transformer encoder and a TCN model, and fuses characteristic information captured by the two models by adopting ADD operation to finally obtain a new vector sequence; the output layer decodes the fused feature vectors by adopting a CRF model to obtain a globally optimal label sequence, and meanwhile, the generalization capability of the model is improved by adopting an R-Drop regularization strategy in the whole training process;

the Transformer encoder is formed by stacking a plurality of encoders, and the structure of each layer of encoder comprises a multi-head self-attention layer, a feedforward network layer, a residual connection and layer standardization;

the TCN model consists of a causal convolution which can be suitable for a sequence structure and an expansion convolution and residual module which is suitable for memorizing historical information;

(3) defining a text sequence as a set of a plurality of text segments span, wherein one text segment is composed of a text token, a head and a tail; the text represents a character or a word, and the head and the tail represent the position indexes of the first character and the last character of the text in the original text sequence respectively, wherein the head and the tail are the same for a single character;

(4) the embedding layer vectorizes each text to represent a matrix

d_modelIs a dimension of a text vector;

(5) encoding interactions between the text segments defined in step (3) using relative position encoding;

(6) calculating attention of the text vector matrix vectorized in the step (4) by using a multi-head attention mechanism in a transform encoder, and adding the position code obtained in the step (5) in the calculation process;

(7) adding output text vector matrixes input after the multi-head attention mechanism is calculated to perform residual connection; meanwhile, layer standardization is adopted for normalization processing;

(8) outputting the result normalized in the step (7) and then sending the result into a feedforward network layer, and performing nonlinear conversion by using ReLU activation;

(9) similarly, residual error connection is carried out on the output after the ReLU activation transformation in the step (8) and the output after normalization in the step (7), layer normalization is adopted for normalization processing, and text characteristics A of a transform encoder are output;

(10) obtaining local characteristic information between text vector matrixes by using a causal convolution and an expansion convolution of the TCN;

(11) the TCN regularizes the text vector matrix by adopting a residual error module, one residual error module comprises convolution and nonlinear mapping of two layers, and finally obtained text characteristics B are output;

(12) fusing the two characteristics of A and B by using an ADD operation;

(13) dropout is added to each layer of the transform encoder and TCN network to regularize the network; an R-Drop regularization strategy is adopted to avoid the problem of inconsistent output distribution of a training model caused by Drop;

2. The method for Chinese named entity recognition by merging sequential convolution and transform encoder as claimed in claim 1, wherein the specific process from step (6) to step (9) is as follows:

inputting the vector matrix E vectorized in the step (4)_XWeight matrices W different from three_q、W_k、W_vMultiplying to obtain three vectors with the same dimensionality, namely a Query vector Q, Key vector K and a Value vector V:

[Q,K,V]＝E_X[W_q,W_k,W_v]

when the attention score is calculated, only the relative position relation between a Query vector and a Key vector is considered, the relative position relation is added into the attention calculation of each layer of the transducer encoder from the attention layer, then the attention Value is obtained by normalizing the attention Value by using a Softmax function, the attention Value is multiplied by a Value vector, and finally the weighted sum of all text vectors is output; each character vector contains not only information of other characters but also word information, position information, and distance information:

A_ij＝(Q_i+u)^TK_j+(Q_i+v)^TR_ijW_R

Att(A,V)＝softmax(A)V

wherein u, v and W_RIs a learnable parameter; a. the_ijIs the attention score, R, of a single text_ijIs a relative position-coding matrix and,

the Transformer encoder is used for performing attention calculation on a text sequence by a plurality of attention heads on the premise of not sharing a weight matrix, and then connecting results of the plurality of attention heads;

Multihead(A)＝[head₁,head₂,...,head_n]W

wherein i is the number of attention heads, and i takes the value of [1, n](ii) a Then, the signal is processed by a feedforward network layer FFN, wherein the FFN is a position multilayer perceptron with nonlinear transformation and can increase the nonlinear expression capability of the model; meanwhile, in order to solve the problem of deep network training degradation, residual error connection and layer standardization are added after a multi-head self-attention layer and a feedforward network layer, and a new sentence matrix is finally obtained: z ═ Z₁,z₂,...,z_T]，

3. The method for Chinese named entity recognition by fused time series convolution and Transformer encoder as claimed in claim 1, wherein the specific process from step (10) to step (11) is as follows:

the causal convolution is characterized in that the output length of the text is equal to the input length, and future information is not considered; input text matrix E_X{x₁,x₂,...,x_TF ═ F }, filter F ═ F₁,f₂,...,f_O) Where O is the size of the convolution kernel at x_iThe causal convolution of (a) is:

in order to make the output generated by the network the same as the input length, the TCN adopts a one-dimensional full convolution network FCN structure, wherein the length of each hidden layer is the same as that of the input layer, and the length adopts zero padding to keep the length of the subsequent layer the same as that of the previous layer;

the dilated convolution exponentially increases the field of view by spacing the number of convolution kernels such that the output of each convolution contains a larger range of information, thereby enabling the output to represent a larger range of input bitsCharacterizing, and capturing longer distance dependencies; input text matrix E_X{x₁,x₂,...,x_TF ═ F for filtering₁,f₂,...,f_O) Then at x_iThe dilated convolution of (d) is:

where d is the dilation factor, O is the size of the convolution kernel, and i- (O-O) d represents the direction of the history; therefore, the dilation convolution amounts to introducing a fixed step size between every two adjacent filter taps; by controlling the size of d, the receptive field is widened on the premise of keeping the calculated amount unchanged, and when d is 1, the expansion convolution is degenerated into common convolution; finally, in order to improve the accuracy rate, the TCN adds the identity mapping of cross-layer link in the residual error network; finally, the output sentence matrix of the model is: b ═ B₁,b₂,...,b_T]，

4. The method for Chinese named entity recognition by fusing sequential convolution and Transformer encoder as claimed in claim 1, wherein the step (12) is to use an ADD feature fusion strategy to fuse the text features outputted from the Transformer encoder and TCN model to obtain a final text representation matrix H ═ z [ (z ═ z) as a result of the fusion of the text features₁+b₁),(z₂+b₂),...,(z_T+b_T)]，

5. The method for Chinese named entity recognition by merging time series convolution and Transformer encoder as claimed in claim 1, wherein the specific process of step (13) is:

training data

n is the number of training samples, given the input data x for each training step_iX is to be_iTwo different output predictions are obtained by the CRF layer through two times of network forward propagation, which are respectively P_θ(y_i|x_i) And P_θ'(y_i|x_i) (ii) a Since Dropout randomly discards some of the neurons at a time, the neurons discarded at each time are different, resulting in the same input data x_iObtaining two different prediction probabilities through the same model; the R-Drop regularizes the output distribution of the training model by minimizing the symmetric KL divergence between the two prediction probabilities;

plus the cross entropy loss function of the model itself:

with the CRF loss function added, the final R-Drop training loss function is:

wherein alpha is used to control

Coefficient weight of (2).

6. The method for Chinese named entity recognition by fused time series convolution and Transformer encoder as claimed in claim 1, wherein the specific process of step (14) is:

CRF is a conditional probability based criterionThe output of the model, i.e. the input sequence of the CRF, is set to X ═ X (X)₁,x₂,...,x_T) Wherein one possible predicted tag sequence is Y ═ (Y)₁,y₂,...,y_T) Defining the evaluation score s as:

wherein the content of the first and second substances,

is the escape probability from label i to label j;

is the y-th of the character_iA score of each tag; the probability P that the sequence Y occurs in all possible predicted sequences is:

wherein the content of the first and second substances,

is a possible predicted sequence observation; y is_tA set of observations for all possible occurrences of the predicted sequence for input sequence X; in the CRF training, a maximum likelihood estimation method is introduced to define a loss function: