CN112380863A

CN112380863A - Sequence labeling method based on multi-head self-attention mechanism

Info

Publication number: CN112380863A
Application number: CN202011187198.0A
Authority: CN
Inventors: 孟洁; 李妍; 刘晨; 张倩宜; 王梓蒴; 单晓怡; 李慕轩; 王林; 刘赫; 董雅茹
Original assignee: State Grid Corp of China SGCC; State Grid Tianjin Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Tianjin Electric Power Co Ltd
Priority date: 2020-10-29
Filing date: 2020-10-29
Publication date: 2021-02-19

Abstract

The invention discloses a sequence labeling method based on a multi-head self-attention mechanism, which comprises the following steps of: step 1, local context semantic coding, namely utilizing BLSTM (binary block notation) serialization to learn local context semantic representation of words in a text, and step 2, global semantic coding, namely utilizing a multi-head self-attention mechanism to code global semantic representation of words based on the local context semantic representation of the words coded in the step 1; and 3, semantic feature fusion, namely fusing the local context semantic representation coded in the step 1 and the global semantic representation coded in the step 2, taking a fusion result as an input semantic feature of the step 4, carrying out sequence annotation, predicting labels by using CRF (fuzzy rule) in order to fully consider the dependency relationship among the labels in a sequence annotation task, and carrying out model training, training a model and reasoning the model 6. The invention further introduces a multi-head self-attention mechanism on the basis of the cyclic neural network to learn the global semantic representation of the words, and improves the sequence labeling effect.

Description

Sequence labeling method based on multi-head self-attention mechanism

Technical Field

The invention relates to the technical field of computer application, in particular to a sequence labeling method based on a multi-head self-attention machine mechanism.

Background

Sequence tagging is an important research topic in natural language processing tasks, and aims to predict a corresponding tag sequence based on a given Text sequence, and mainly comprises tasks such as Named Entity Recognition (NER), chunk parsing (Text Chunking), Part-Of-Speech tagging (POS), and Opinion Extraction (Opinion Extraction).

The early sequence labeling method is mostly based on rules, a rule template and a large amount of expert knowledge need to be established, a large amount of manpower and material resources are consumed, and meanwhile, the method is not easy to expand and transplant to other fields. Such as waning et al, manually establish a knowledge base of financial company name identification in a rule-based manner. Toral and Mu automatically build and maintain a gazetteers (name, organization, place, and other entity list) for entity identification based on online wikipedia analysis. Zizhenning et al constructed and customized a named entity recognition marker, which, although field-adaptive and achieved good experimental results, was still manually-based and time-consuming.

Due to the shortcomings of rule-based methods, machine learning models based on statistical learning methods are increasingly being applied to sequence labeling, such as Support Vector Machines (SVMs), Hidden Markov Models (HMMs), Conditional Random Fields (CRFs), maximum entropy Models (MEs), and the like. For example, Mayfield et al utilize SVM to capture hundreds of features from training data for training. Zhou and Su propose an HMM-based named entity recognition system that can apply and fuse simple features of words (such as case, number, etc.). Mccallum and Li apply CRFs in named entity recognition, performing well on multiple datasets. Liu Yan super et al apply the ME model to named entity recognition, while incorporating a method of fusing local features and global features inside sentences. Although the method based on the statistical learning model achieves better performance, the method still depends on artificial features seriously and has the defect that only local features can be captured.

In recent years, with the rapid development of deep learning, the strong learning and automatic feature extraction capabilities thereof have been successful in natural language processing. Thus, deep learning is also widely used in many tasks of sequence annotation. For example, Zhang Miao et al applied the BLSTM-CRF framework model to sequence annotation, which achieved the most competitive performance because BLSTM effectively utilized context features and CRF modeled sentence-level tag information. Chiu proposes a novel model BLSTM-CNN, obtains character characteristics through CNN, and is spliced with word embedding and sent into BLSTM, although the effect is good, dictionary or vocabulary characteristics are used. Recently, attention has been drawn to the mechanism that is increasingly applied to the many tasks of sequence annotation. Compared with the semantic dependence of LSTM or CNN in modeling, the attention mechanism is not concerned about the length of distance. For example, Rei et al combine attention mechanism learning weight coefficients on the basis of the BLSTM-CRF framework, and input the weighted sum of the two features into the CRF for label prediction. Luo et al demonstrate that the introduction of the attention mechanism into BLSTM-CRF can improve the chemical drug entity recognition effect, can improve the labeling consistency at the document level, and can enrich context information at the sentence level. Tan et al propose to use a depth attention network for sequence annotation, using a depth model of N layers, each layer comprising a non-linear layer and a self-attention layer, and taking the output of the highest layer as the input of the softmax layer. Although the existing deep learning-based method achieves better performance, the defects of local dependency, inaccurate position information acquisition and the like still exist.

In summary, most of the existing sequence labeling methods are constructed based on an LSTM-CRF framework, but learning the context semantic representation of words in a text by using an LSTM as an encoder usually has two problems: first, the cyclic neural network-based sequence annotation model usually has local dependency, and there is semantic loss for semantic information at a long distance. And the longer the distance between two words, the more obvious this problem is. Secondly, the sequence labeling model based on the recurrent neural network is limited by serialized feature learning, and thus the semantic relationship between any two words in the text cannot be flexibly modeled.

Disclosure of Invention

The invention aims to provide a sequence annotation method based on a multi-head self-attention mechanism aiming at the problems of local dependency and serialization coding in the sequence annotation method in the prior art,

the technical scheme adopted for realizing the purpose of the invention is as follows:

a sequence labeling method based on a multi-head self-attention mechanism comprises the following steps executed in sequence:

step 1, local context semantic coding, namely learning local context semantic representation of words in a text in a BLSTM serialization manner:

step 1.1, performing word segmentation on an input text to obtain a corresponding word sequence;

step 1.2, for each word in the word sequence, coding character-level vector representation corresponding to each word by using a BLSTM structure;

step 1.3, for each word in the word sequence, splicing the character-level vector representation and the word embedding vector representation coded in the step 1.2 to serve as word initial semantic representation;

step 1.4, based on the word initial semantic representation obtained in step 1.3, using BLSTM to encode the local context semantic representation of each word;

step 2, global semantic coding, namely coding the global semantic representation of the words by utilizing a multi-head self-attention mechanism based on the local context semantic representation of the words coded in the step 1:

step 2.1, mapping the local context semantic representation of the words coded in step 1 to a plurality of different feature subspaces by adopting a full connection layer;

step 2.2, under different feature subspaces obtained in step 2.1, utilizing a self-attention mechanism to encode semantic representation of words;

step 2.3, the semantic representations of the words in each feature subspace calculated in the step 2.2 are spliced, and the splicing result is input into a full connection layer to obtain the global semantic representation corresponding to each word;

and 3, semantic feature fusion, namely constructing the following three feature fusion modes, fusing the local context semantic representation coded in the step 1 and the global semantic representation coded in the step 2, and taking a fusion result as an input semantic feature in the step 4:

step 3.1, constructing a one-dimensional parameter fusion method to realize the linear combination of local context semantics and global semantics;

step 3.2, building a multi-dimensional parameter fusion method by using a gating mechanism adopted in the LSTM for reference;

3.3, constructing a free weight semantic fusion method;

and 4, carrying out sequence annotation, namely predicting the labels by using CRF (fuzzy C) in order to fully consider the dependency relationship among the labels in the sequence annotation task:

step 4.1, performing full-connection transformation on the fused semantic feature sequence obtained in the step 3 to obtain a state feature matrix, and representing the association between the semantics of each word and the label;

step 4.2, initializing a transfer characteristic matrix randomly to express the transfer relation between the labels;

step 4.3, calculating the corresponding score and probability of any possible label sequence based on the state characteristic matrix obtained in the step 4.1 and the transfer characteristic matrix obtained in the step 4.2;

step 5, model training: in the model training process, optimizing the parameters in the steps 1 to 4 by adopting the probability corresponding to the maximized standard label sequence;

step 6, model reasoning: in the practical application process, the optimal label sequence is searched by adopting a Viterbi algorithm, and model reasoning is carried out.

In the above technical solution, in the step 1.1, a Stanford NLP toolkit is used to perform word segmentation on an input text.

In the above technical solution, in the step 1.3, the initial semantic representation of the word

Wherein,

for the character-level vector representation in question,

a vector representation is embedded for the word.

In the above technical solution, in step 1.4, the word initial semantic representation sequence E ═ E { based on step 1.3 is described₁,e₂,…,e_NUsing BLSTM to encode each word x in the text_iIs used to represent the local context semantic representation h_i：

In the above technical solution, in step 2.1, the word local context semantic representation sequence H encoded in step 1 is { H ═ H₁，h₂，…，h_NMapping to M different feature subspaces, wherein the mapping mode of the ith feature subspace is as follows:

in the formula,

and

is a model parameter; q represents a query in the attention mechanism, K represents a keyword, and V represents a value corresponding to the keyword.

In the above technical solution, in the step 2.2, in different feature subspaces obtained in the step 2.1, a self-attention mechanism based on dot product is used to encode semantic representation of a word:

head_i＝Attention(Q_i，K_i，V_i)

in the formula (d)_kRepresenting the dimension of the feature in the subspace, and T represents the transpose operation of the matrix.

In the above technical solution, in the step 2.3, the semantic representation head under each feature subspace calculated in the step 2.2 is used_iAnd splicing, inputting a splicing result into a full-connection layer, and obtaining a global semantic representation sequence Z corresponding to each word:

Z＝[head₁；head₂；…；head_M]W^zin the formula, W^zAre model parameters.

In the above technical solution, in step 3, the semantic representation after the one-dimensional parameter fusion: u. of_i＝(1-β_i)·h_i+β_i·z_i，

In the formula beta_i＝sigmoid(W_β[h_i；z_i])，h_iFor local context semantic representation, z_iFor global semantic representation, W_βIs a model parameter;

and (3) carrying out semantic representation after multi-dimensional parameter fusion: u. of_i＝(I-α_i)⊙h_i+α_i⊙z_i，

In the formula of alpha_i＝sigmoid(W_α[h_i；z_i]) ,_αis a model parameter;

constructing a semantic representation of semantic fusion of free weights: u. of_i＝γ_i⊙h_i+δ_i⊙z_i，

Where γ and δ are two trainable parameters.

In the above technical solution, in step 4.1, the fused semantic feature sequence U obtained in step 3 is set to { U ═₁，u₂，…，u_NCarry out full connection transformation to obtain state feature matrix P, tableIndicating the association between the semantics of each word and the label:

P＝UW_p+b_p

in the formula, W_pAnd b_pAre model parameters.

In the above technical solution, in the step 4.3, any one of possible label sequences is calculated based on the state feature matrix obtained in the step 4.1 and the transfer feature matrix obtained in the step 4.2

The corresponding score is:

based on the score, calculating the probability corresponding to the label sequence:

in the model training process, the probability P (Y | X) corresponding to the standard label sequence is maximized, and the parameters in the steps 1 to 4 are optimized by adopting a mode of minimizing the following negative log-likelihood function:

in step 6, the optimal tag sequence is searched by using the viterbi algorithm:

compared with the prior art, the invention has the beneficial effects that:

1. the invention further introduces a multi-head self-attention mechanism on the basis of the cyclic neural network to learn the global semantic representation of the words, improves the sequence labeling effect and effectively relieves the problems of local dependency and sequence coding brought by coding by using the cyclic neural network.

2. The local context semantics of the cyclic neural network coding comprehensively considers the short-distance semantics of the words and the word order relation between the words, while the global semantics of the multi-head self-attention mechanism coding can not be limited by distance in the modeling semantics, thus making up for the defect of long-distance semantic modeling in the cyclic neural network, but lacking in the modeling of the word order. Therefore, the local semantics and the global semantics have certain complementarity, the invention comprehensively considers the two semantics, constructs a fusion method of three semantic features, fuses the local semantic features learned by the BLSTM and the global semantic features learned by the multi-head self-attention mechanism, achieves the effect of advantage complementation, takes the fusion result as the input semantic features, and improves the effect of sequence annotation.

Drawings

FIG. 1 is a schematic diagram of the overall structure of the present invention.

FIG. 2 is a schematic diagram of a sequence labeling method based on a multi-head self-attention mechanism.

Detailed Description

The present invention will be described in further detail with reference to specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Example 1

The invention first learns the context semantic features of words in a text by using a bidirectional long and short term memory unit (BLSTM). And then, modeling the semantic relation between any two words in the text by adopting a multi-head self-attention mechanism based on the hidden representation learned by the BLSTM, and further obtaining the global semantics of each word to be focused on. In order to fully consider the complementarity of local context semantics and global semantics, the invention designs three feature fusion modes to fuse the two parts of semantics, and uses a conditional random field model (CRF) to predict a label sequence based on the fused features.

Example 2

The invention mainly adopts a deep learning technology and a theoretical method related to natural language processing to realize a sequence labeling task, and in order to ensure the normal operation of a system, in the specific implementation, a computer platform is required to be provided with a memory not lower than 8G, a CPU core number is not lower than 4, a main frequency is not lower than 2.6GHz, a GPU environment and a Linux operating system are required, and necessary software environments such as Python 3.6 and above versions, pytorch0.4 and above versions and the like are installed.

As shown in fig. 1, the sequence annotation method based on the multi-head attention-machine system provided by the present invention mainly includes the following steps executed in sequence:

step 1, local context semantic coding: the local context semantic representation of words in text is learned sequentially using a bidirectional long-short term memory network (BLSTM).

Step 1.1) the Stanford NLP toolkit is adopted to perform word segmentation on the input text to obtain a corresponding word sequence.

Step 1.2) for each word in the word sequence, encoding character-level vector representation corresponding to each word by using a Bidirectional LSTM (BLSTM) structure.

Step 1.3) for each word in the text, splicing the character-level vector representation encoded in step 1.2) with the word-embedded vector representation as the initial semantic representation of the word.

Step 1.4) using local context semantic representation of each word in the BLSTM encoded text: inputting the initial semantic representation of the words obtained in the step 1.3), and outputting the local context semantic representation of each word.

Step 2, global semantic coding: encoding a global semantic representation of the word using a multi-headed self-attention mechanism based on the local context semantic representation of the word encoded in step 1).

Step 2.1) mapping the local context semantic representation of the word encoded in step 1 to a plurality of different feature subspaces using a full connectivity layer.

Step 2.2) utilizing a self-attention mechanism to encode semantic representation of the words under different feature subspaces obtained in the step 2.1).

And 2.3) splicing the semantic representations under each feature subspace calculated in the step 2.2), and inputting the splicing result into a full connection layer to obtain the global semantic representation corresponding to each word.

Step 3, semantic feature fusion: and constructing the following three feature fusion modes, fusing the local semantic representation coded in the step 1) and the global semantic representation coded in the step 2), and taking the fusion result as the input semantic feature of the step 4.

And 3.1) constructing a one-dimensional parameter fusion method to realize the linear combination of local semantics and global semantics.

And 3.2) constructing a multi-dimensional parameter fusion method by using a gating mechanism adopted in the LSTM for reference.

And 3.3) constructing a free weight semantic fusion method.

And 4, sequence labeling: in order to fully consider the dependency relationship among the labels in the sequence labeling task, the CRF is utilized to predict the labels.

And 4.1) carrying out full-connection transformation on the semantic feature sequence obtained in the step 3 after fusion to obtain a state feature matrix, and representing the association between the semantics of each word and the label.

And 4.2) randomly initializing a transfer characteristic matrix to represent the transfer relation between the labels.

And 4.3) calculating the corresponding score and probability of any possible label sequence based on the state characteristic matrix obtained in the step 4.1) and the transfer characteristic matrix obtained in the step 4.2).

Step 5, model training: in the model training process, the parameters in the steps 1 to 4 are optimized by adopting the probability corresponding to the maximized standard label sequence.

Example 3

The sequence labeling method based on the multi-head self-attention mechanism mainly comprises the following steps of sequentially executing:

Step (ii) of1.1, using Stanford NLP tool bag to perform word segmentation on the input text to obtain the corresponding word sequence X ═ { X ═ X₁,x₂,…,x_N}。

For example, given the text "i am participating in a marathon race at Tianjin yesterday", the word sequence { "i", "yesterday", "in", "Tianjin", "participating in", "having", "a", "marathon", "race" } is obtained after word segmentation.

Step 1.2, considering that the words in the text usually contain rich morphological characteristics, such as prefix and suffix information, this step is performed for each word in the word sequence

Encoding each word x using a bi-directional LSTM (BLSTM) structure_iCorresponding character-level vector representation

Wherein, c_i,jThe jth character representing the ith word in the text.

For example: for the 4 th word "Tianjin" in the sequence of words, its 1 st character is "day" and its 2 nd character is "jin". By BLSTM encoding, a character-level vector representation of "Tianjin" can be obtained

Step 1.3) for each word in the text, firstly, the index of each word in the predefined word list is found by using a table look-up method, and the corresponding vector representation is searched from the pre-trained word vector set by using the index and is used as the word embedding vector representation of the word

Subsequently, the character-level vector representation encoded in step 1.2) is represented

Word-embedded vector representation corresponding to a word

Concatenating as an initial semantic representation e of the word_i：

For example, for the 4 th word "Tianjin" in the sequence of words, its corresponding word embedding vector is represented as

The initial semantic representation e of Tianjin can be obtained by splicing the character-level features and the word embedding vectors₄＝[0.04,-0.77,…,0.31；0.11,0.89,…,-0.25]。

Step 1.4) based on the initial semantic expression sequence E ═ { E) of the words obtained in step 1.3)₁,e₂,…,e_NUsing BLSTM to encode each word x in the text_iIs used to represent the local context semantic representation h_i：

For example, when the text is BLSTM encoded, the local context semantic corresponding to the 4 th word "tianjin" in the word sequence is represented as h₄＝[0.02,0.11,…,0.76]。

Step 2) global semantic coding: encoding a global semantic representation of the word using a multi-headed self-attention mechanism based on the local context semantic representation of the word encoded in step 1).

Step 2.1) to learn more diverse global semantic representations using the self-attention mechanism, the present inventionStep of adopting a full connection layer to represent the local context semantic expression sequence H ═ H of the words coded in the step 1) in a semantic mode₁，h₂，…，h_NMaps to M different feature subspaces. The mapping mode of the ith feature subspace is as follows:

in the formula,

and

For example, the context semantic representation sequence encoded in step 1) may be

Through the transformation of the full connection layer, the query required by the attention mechanism in the ith feature subspace can be obtained

Keyword

Sum value

Step 2.2) encoding the semantic representation of the word by using a self-attention mechanism based on dot product under different feature subspaces obtained in step 2.1):

head_i＝Attention(Q_i，K_i，V_i) (8)

For example, in the ith feature subspace, the semantic representation encoded by the attention mechanism may be

Step 2.3) representing the semantic meaning head under each feature subspace calculated in the step 2.2)_iAnd splicing, and inputting a splicing result into a full connection layer to obtain a global semantic representation sequence Z corresponding to each word.

Z＝[head₁；head₂；…；head_M]W^z (10)

In the formula, W^zAre model parameters.

For example, through splicing and full connection layers, a global semantic representation sequence can be obtained

Step 3), semantic feature fusion: although the attention mechanism is not limited by distance when modeling semantics or syntax is dependent and can make up for the defects of BLSTM long-distance semantic modeling, the attention mechanism is an unordered computing mechanism and the context on the sequence can be lost in the modeling process. Therefore, three feature fusion modes are constructed in the step, the local semantic features H learned by BLSTM in the step 2) and the global semantic features Z learned by the multi-head self-attention mechanism in the step 3) are fused, the effect of advantage complementation is achieved, and the fusion result U is used as the input semantic features in the step 4).

Step 3.1) one-dimensional parameter fusion method: for the ith word in the text, the corresponding local context semantic meaning h is firstly expressed_iAnd a global semantic representation z_iSplicing, mapping the obtained object to a one-dimensional space by using a full connection layer, and obtaining a fusion weight beta by using sigmoid as an activation function_i：

β_i＝sigmoid(W_β[h_i；z_i]) (11)

Semantic representation after one-dimensional parameter fusion: u. of_i＝(1-β_i)·h_i+β_i·z_i (12)

In the formula, W_βAre model parameters.

For example, the local context semantic corresponding to the 4 th word "Tianjin" in the word sequence is represented as h₄＝[0.02，0.11，…，0.76]The global semantic representation is z₄＝[0.14，0.09，…，-0.26]. Through calculation, beta is obtained₄When the value is 0.4, the fused semantic representation u₄＝[0.07，0.10，…，0.35]。

Step 3.2) the multi-dimensional parameter fusion method comprises the following steps: the method uses a gating mechanism in the LSTM for reference, and for the ith word in the text, the corresponding local semantic meaning is firstly expressed as h_iAnd a global semantic representation z_iSplicing, mapping the full connection layer to a weight space with the same dimension as the semantic representation by using a full connection layer, and obtaining a fusion weight vector alpha by using sigmoid as an activation function_i：

α_i＝sigmoid(W_α[h_i；z_i]) (13)

In the formula, W_αAre model parameters. And then, fusing the local semantics and the global semantics by adopting a method of multiplying corresponding elements:

and (3) carrying out semantic representation after multi-dimensional parameter fusion: u. of_i＝(I-α_i)⊙h_i+α_i⊙z_i (14)

In the formula, l represents multiplication of elements, and l represents a column vector in which all elements are 1.

E.g. in a sequence of wordsThe 4 th word "Tianjin" is expressed as h₄＝[0.02，0.11，…，0.76]The global semantic representation is z₄＝[0.14，0.09，…，-0.26]. Through calculation, alpha is obtained₄＝[0.31，0.1，…，0.4]Then the fused semantic representation u₄＝[0.06，0.11，…，0.35]。

Step 3.3) constructing a free weight semantic fusion method, specifically, randomly initializing two trainable parameters gamma and delta in the step, and performing semantic feature fusion by using the two parameters:

u_i＝γ_i⊙h_i+δ_i⊙z_i (15)

for example, the local context semantic corresponding to the 4 th word "Tianjin" in the word sequence is represented as h₄＝[0.02，0.11，…，0.76]The global semantic representation is z₄＝[0.14，0.09，…，-0.26]. Through model optimization, gamma₄＝[0.19，0.52，…，-0.11]，δ_i＝[-0.22，0.98，…，0.17]Then the fused semantic representation u₄＝[-0.03，0.15，…，0.13]。

Step 4), sequence labeling: in order to fully consider the dependency relationship among the labels in the sequence labeling task, the CRF is utilized to predict the labels.

Step 4.1) merging semantic feature sequences U ═ U { U } obtained in step 3)₁，u₂，…，u_N}: (in practical application, a mode is selected from the step 3) to fuse the features, and the fusion result is used as the input of the step 4)) to perform full-connection transformation to obtain a state feature matrix P which represents the association between the semantics of each word and the label

P＝UW_p+b_p (16)

In the formula, W_pAnd b_pAre model parameters.

For example, for the 4 th word "Tianjin" in the sequence of words, its state feature may be p₄＝[0.01，0.91，…，0.00]。

And 4.2) randomly initializing a transfer characteristic matrix A to represent the transfer relation between the labels, wherein the matrix is optimized through loss back propagation in the model training process.

For example, the transfer feature matrix may be

Step 4.3) calculating any one possible label sequence based on the state feature matrix obtained in step 4.1) and the transfer feature matrix obtained in step 4.2)

The corresponding score is:

for example, for named entity recognition tasks, the word sequence { "i", "yesterday", "in", "Tianjin", "join", "having", "one", "marathon", "match" } corresponds to the tag sequence

The corresponding probability is 0.9.

Step 5) model training: in the model training process, the method maximizes the probability P (Y | X) corresponding to the standard label sequence. Therefore, the present invention optimizes the parameters in steps 1) to 4) in a manner that minimizes the following negative log-likelihood function:

step 6) model reasoning: in the practical application process, the invention adopts the Viterbi algorithm to search the optimal label sequence:

the foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A sequence labeling method based on a multi-head self-attention mechanism is characterized by comprising the following steps of sequentially executing:

3.3, constructing a free weight semantic fusion method;

2. The method for labeling sequences based on the multi-head self-attention mechanism as claimed in claim 1, wherein in step 1.1, the Stanford NLP toolkit is used to perform word segmentation on the input text.

3. The method for sequence annotation based on multi-head attention mechanism as claimed in claim 1, wherein in step 1.3, the initial semantic representation of the word

Wherein,

for the character-level vector representation in question,

a vector representation is embedded for the word.

4. The method for labeling sequences based on the multi-head attention mechanism as claimed in claim 1, wherein in step 1.4, the initial semantic representation sequence E ═ { E } based on the words obtained in step 1.3 is used₁，e₂，...，e_NUsing BLSTM to encode each word x in the text_iIs used to represent the local context semantic representation h_i：

5. The method for labeling sequences based on the multi-head attention mechanism as claimed in claim 1, wherein in step 2.1, the word local context encoded in step 1 is usedSemantic representation sequence H ═ H₁，h₂，...，h_NMapping to M different feature subspaces, wherein the mapping mode of the ith feature subspace is as follows:

in the formula,

and

6. The method for sequence annotation based on multi-head attention mechanism as claimed in claim 1, wherein in step 2.2, the semantic representation of the word is encoded by using the self-attention mechanism based on dot product under different feature subspaces obtained in step 2.1:

head_i＝Attention(Q_i，K_i，V_i)

7. The method for sequence annotation based on multi-head attention mechanism as claimed in claim 1, wherein in step 2.3, the semantic representation head under each feature subspace calculated in step 2.2 is used_iAnd splicing, inputting a splicing result into a full-connection layer, and obtaining a global semantic representation sequence Z corresponding to each word:

Z＝[head₁；head₂；...；head_M]W^zin the formula, W^zAre model parameters.

8. The method for sequence annotation based on the multi-head attention mechanism as claimed in claim 1, wherein in the step 3, the semantic representation after the one-dimensional parameter fusion: u. of_i＝(1-β_i)·h_i+β_i·z_i，

In the formula of alpha_i＝sigmoid(W_α[h_i；z_i]) ,_αis a model parameter;

Where γ and δ are two trainable parameters.

9. The method for sequence annotation based on the multi-head attention mechanism as claimed in claim 1, wherein in step 4.1, the fused semantic feature sequence U ═ { U ═ obtained in step 3₁，u₂，...，u_NCarry out full connection transformation to obtain a state feature matrix P, which represents eachAssociation between the semantics of the word and the label:

P＝UW_p+b_p

in the formula, W_pAnd b_pAre model parameters.

10. The method for sequence annotation based on multi-head attention mechanism as claimed in claim 1, wherein in step 4.3, any one of the possible tag sequence is calculated based on the state feature matrix obtained in step 4.1 and the transition feature matrix obtained in step 4.2

The corresponding score is:

in step 6, the optimal tag sequence is searched by using the viterbi algorithm: