CN112380863A - Sequence labeling method based on multi-head self-attention mechanism - Google Patents
Sequence labeling method based on multi-head self-attention mechanism Download PDFInfo
- Publication number
- CN112380863A CN112380863A CN202011187198.0A CN202011187198A CN112380863A CN 112380863 A CN112380863 A CN 112380863A CN 202011187198 A CN202011187198 A CN 202011187198A CN 112380863 A CN112380863 A CN 112380863A
- Authority
- CN
- China
- Prior art keywords
- word
- sequence
- semantic
- semantic representation
- attention mechanism
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000007246 mechanism Effects 0.000 title claims abstract description 52
- 238000002372 labelling Methods 0.000 title claims abstract description 23
- 230000004927 fusion Effects 0.000 claims abstract description 29
- 238000012549 training Methods 0.000 claims abstract description 15
- 238000000034 method Methods 0.000 claims description 36
- 239000011159 matrix material Substances 0.000 claims description 27
- 239000013598 vector Substances 0.000 claims description 24
- 238000012546 transfer Methods 0.000 claims description 14
- 238000007500 overflow downdraw method Methods 0.000 claims description 13
- 230000008569 process Effects 0.000 claims description 12
- 238000013507 mapping Methods 0.000 claims description 10
- 230000011218 segmentation Effects 0.000 claims description 7
- 230000009466 transformation Effects 0.000 claims description 7
- 238000004422 calculation algorithm Methods 0.000 claims description 6
- 230000006870 function Effects 0.000 claims description 3
- 230000007704 transition Effects 0.000 claims 1
- 238000013528 artificial neural network Methods 0.000 abstract description 7
- 230000000694 effects Effects 0.000 abstract description 7
- 125000004122 cyclic group Chemical group 0.000 abstract description 6
- 238000003058 natural language processing Methods 0.000 description 6
- 230000002457 bidirectional effect Effects 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 4
- 230000007547 defect Effects 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 230000015654 memory Effects 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000005034 decoration Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002860 competitive effect Effects 0.000 description 1
- 238000004883 computer application Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000003550 marker Substances 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 230000000877 morphologic effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Databases & Information Systems (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a sequence labeling method based on a multi-head self-attention mechanism, which comprises the following steps of: step 1, local context semantic coding, namely utilizing BLSTM (binary block notation) serialization to learn local context semantic representation of words in a text, and step 2, global semantic coding, namely utilizing a multi-head self-attention mechanism to code global semantic representation of words based on the local context semantic representation of the words coded in the step 1; and 3, semantic feature fusion, namely fusing the local context semantic representation coded in the step 1 and the global semantic representation coded in the step 2, taking a fusion result as an input semantic feature of the step 4, carrying out sequence annotation, predicting labels by using CRF (fuzzy rule) in order to fully consider the dependency relationship among the labels in a sequence annotation task, and carrying out model training, training a model and reasoning the model 6. The invention further introduces a multi-head self-attention mechanism on the basis of the cyclic neural network to learn the global semantic representation of the words, and improves the sequence labeling effect.
Description
Technical Field
The invention relates to the technical field of computer application, in particular to a sequence labeling method based on a multi-head self-attention machine mechanism.
Background
Sequence tagging is an important research topic in natural language processing tasks, and aims to predict a corresponding tag sequence based on a given Text sequence, and mainly comprises tasks such as Named Entity Recognition (NER), chunk parsing (Text Chunking), Part-Of-Speech tagging (POS), and Opinion Extraction (Opinion Extraction).
The early sequence labeling method is mostly based on rules, a rule template and a large amount of expert knowledge need to be established, a large amount of manpower and material resources are consumed, and meanwhile, the method is not easy to expand and transplant to other fields. Such as waning et al, manually establish a knowledge base of financial company name identification in a rule-based manner. Toral and Mu automatically build and maintain a gazetteers (name, organization, place, and other entity list) for entity identification based on online wikipedia analysis. Zizhenning et al constructed and customized a named entity recognition marker, which, although field-adaptive and achieved good experimental results, was still manually-based and time-consuming.
Due to the shortcomings of rule-based methods, machine learning models based on statistical learning methods are increasingly being applied to sequence labeling, such as Support Vector Machines (SVMs), Hidden Markov Models (HMMs), Conditional Random Fields (CRFs), maximum entropy Models (MEs), and the like. For example, Mayfield et al utilize SVM to capture hundreds of features from training data for training. Zhou and Su propose an HMM-based named entity recognition system that can apply and fuse simple features of words (such as case, number, etc.). Mccallum and Li apply CRFs in named entity recognition, performing well on multiple datasets. Liu Yan super et al apply the ME model to named entity recognition, while incorporating a method of fusing local features and global features inside sentences. Although the method based on the statistical learning model achieves better performance, the method still depends on artificial features seriously and has the defect that only local features can be captured.
In recent years, with the rapid development of deep learning, the strong learning and automatic feature extraction capabilities thereof have been successful in natural language processing. Thus, deep learning is also widely used in many tasks of sequence annotation. For example, Zhang Miao et al applied the BLSTM-CRF framework model to sequence annotation, which achieved the most competitive performance because BLSTM effectively utilized context features and CRF modeled sentence-level tag information. Chiu proposes a novel model BLSTM-CNN, obtains character characteristics through CNN, and is spliced with word embedding and sent into BLSTM, although the effect is good, dictionary or vocabulary characteristics are used. Recently, attention has been drawn to the mechanism that is increasingly applied to the many tasks of sequence annotation. Compared with the semantic dependence of LSTM or CNN in modeling, the attention mechanism is not concerned about the length of distance. For example, Rei et al combine attention mechanism learning weight coefficients on the basis of the BLSTM-CRF framework, and input the weighted sum of the two features into the CRF for label prediction. Luo et al demonstrate that the introduction of the attention mechanism into BLSTM-CRF can improve the chemical drug entity recognition effect, can improve the labeling consistency at the document level, and can enrich context information at the sentence level. Tan et al propose to use a depth attention network for sequence annotation, using a depth model of N layers, each layer comprising a non-linear layer and a self-attention layer, and taking the output of the highest layer as the input of the softmax layer. Although the existing deep learning-based method achieves better performance, the defects of local dependency, inaccurate position information acquisition and the like still exist.
In summary, most of the existing sequence labeling methods are constructed based on an LSTM-CRF framework, but learning the context semantic representation of words in a text by using an LSTM as an encoder usually has two problems: first, the cyclic neural network-based sequence annotation model usually has local dependency, and there is semantic loss for semantic information at a long distance. And the longer the distance between two words, the more obvious this problem is. Secondly, the sequence labeling model based on the recurrent neural network is limited by serialized feature learning, and thus the semantic relationship between any two words in the text cannot be flexibly modeled.
Disclosure of Invention
The invention aims to provide a sequence annotation method based on a multi-head self-attention mechanism aiming at the problems of local dependency and serialization coding in the sequence annotation method in the prior art,
the technical scheme adopted for realizing the purpose of the invention is as follows:
a sequence labeling method based on a multi-head self-attention mechanism comprises the following steps executed in sequence:
step 1, local context semantic coding, namely learning local context semantic representation of words in a text in a BLSTM serialization manner:
step 1.1, performing word segmentation on an input text to obtain a corresponding word sequence;
step 1.2, for each word in the word sequence, coding character-level vector representation corresponding to each word by using a BLSTM structure;
step 1.3, for each word in the word sequence, splicing the character-level vector representation and the word embedding vector representation coded in the step 1.2 to serve as word initial semantic representation;
step 1.4, based on the word initial semantic representation obtained in step 1.3, using BLSTM to encode the local context semantic representation of each word;
step 2, global semantic coding, namely coding the global semantic representation of the words by utilizing a multi-head self-attention mechanism based on the local context semantic representation of the words coded in the step 1:
step 2.1, mapping the local context semantic representation of the words coded in step 1 to a plurality of different feature subspaces by adopting a full connection layer;
step 2.2, under different feature subspaces obtained in step 2.1, utilizing a self-attention mechanism to encode semantic representation of words;
step 2.3, the semantic representations of the words in each feature subspace calculated in the step 2.2 are spliced, and the splicing result is input into a full connection layer to obtain the global semantic representation corresponding to each word;
and 3, semantic feature fusion, namely constructing the following three feature fusion modes, fusing the local context semantic representation coded in the step 1 and the global semantic representation coded in the step 2, and taking a fusion result as an input semantic feature in the step 4:
step 3.1, constructing a one-dimensional parameter fusion method to realize the linear combination of local context semantics and global semantics;
step 3.2, building a multi-dimensional parameter fusion method by using a gating mechanism adopted in the LSTM for reference;
3.3, constructing a free weight semantic fusion method;
and 4, carrying out sequence annotation, namely predicting the labels by using CRF (fuzzy C) in order to fully consider the dependency relationship among the labels in the sequence annotation task:
step 4.1, performing full-connection transformation on the fused semantic feature sequence obtained in the step 3 to obtain a state feature matrix, and representing the association between the semantics of each word and the label;
step 4.2, initializing a transfer characteristic matrix randomly to express the transfer relation between the labels;
step 4.3, calculating the corresponding score and probability of any possible label sequence based on the state characteristic matrix obtained in the step 4.1 and the transfer characteristic matrix obtained in the step 4.2;
step 5, model training: in the model training process, optimizing the parameters in the steps 1 to 4 by adopting the probability corresponding to the maximized standard label sequence;
step 6, model reasoning: in the practical application process, the optimal label sequence is searched by adopting a Viterbi algorithm, and model reasoning is carried out.
In the above technical solution, in the step 1.1, a Stanford NLP toolkit is used to perform word segmentation on an input text.
In the above technical solution, in the step 1.3, the initial semantic representation of the wordWherein,for the character-level vector representation in question,a vector representation is embedded for the word.
In the above technical solution, in step 1.4, the word initial semantic representation sequence E ═ E { based on step 1.3 is described1,e2,…,eNUsing BLSTM to encode each word x in the textiIs used to represent the local context semantic representation hi:
In the above technical solution, in step 2.1, the word local context semantic representation sequence H encoded in step 1 is { H ═ H1,h2,…,hNMapping to M different feature subspaces, wherein the mapping mode of the ith feature subspace is as follows:
in the formula,andis a model parameter; q represents a query in the attention mechanism, K represents a keyword, and V represents a value corresponding to the keyword.
In the above technical solution, in the step 2.2, in different feature subspaces obtained in the step 2.1, a self-attention mechanism based on dot product is used to encode semantic representation of a word:
headi=Attention(Qi,Ki,Vi)
in the formula (d)kRepresenting the dimension of the feature in the subspace, and T represents the transpose operation of the matrix.
In the above technical solution, in the step 2.3, the semantic representation head under each feature subspace calculated in the step 2.2 is usediAnd splicing, inputting a splicing result into a full-connection layer, and obtaining a global semantic representation sequence Z corresponding to each word:
Z=[head1;head2;…;headM]Wzin the formula, WzAre model parameters.
In the above technical solution, in step 3, the semantic representation after the one-dimensional parameter fusion: u. ofi=(1-βi)·hi+βi·zi,
In the formula betai=sigmoid(Wβ[hi;zi]),hiFor local context semantic representation, ziFor global semantic representation, WβIs a model parameter;
and (3) carrying out semantic representation after multi-dimensional parameter fusion: u. ofi=(I-αi)⊙hi+αi⊙zi,
In the formula of alphai=sigmoid(Wα[hi;zi]) ,αis a model parameter;
constructing a semantic representation of semantic fusion of free weights: u. ofi=γi⊙hi+δi⊙zi,
Where γ and δ are two trainable parameters.
In the above technical solution, in step 4.1, the fused semantic feature sequence U obtained in step 3 is set to { U ═1,u2,…,uNCarry out full connection transformation to obtain state feature matrix P, tableIndicating the association between the semantics of each word and the label:
P=UWp+bp
in the formula, WpAnd bpAre model parameters.
In the above technical solution, in the step 4.3, any one of possible label sequences is calculated based on the state feature matrix obtained in the step 4.1 and the transfer feature matrix obtained in the step 4.2The corresponding score is:
based on the score, calculating the probability corresponding to the label sequence:
in the model training process, the probability P (Y | X) corresponding to the standard label sequence is maximized, and the parameters in the steps 1 to 4 are optimized by adopting a mode of minimizing the following negative log-likelihood function:
in step 6, the optimal tag sequence is searched by using the viterbi algorithm:
compared with the prior art, the invention has the beneficial effects that:
1. the invention further introduces a multi-head self-attention mechanism on the basis of the cyclic neural network to learn the global semantic representation of the words, improves the sequence labeling effect and effectively relieves the problems of local dependency and sequence coding brought by coding by using the cyclic neural network.
2. The local context semantics of the cyclic neural network coding comprehensively considers the short-distance semantics of the words and the word order relation between the words, while the global semantics of the multi-head self-attention mechanism coding can not be limited by distance in the modeling semantics, thus making up for the defect of long-distance semantic modeling in the cyclic neural network, but lacking in the modeling of the word order. Therefore, the local semantics and the global semantics have certain complementarity, the invention comprehensively considers the two semantics, constructs a fusion method of three semantic features, fuses the local semantic features learned by the BLSTM and the global semantic features learned by the multi-head self-attention mechanism, achieves the effect of advantage complementation, takes the fusion result as the input semantic features, and improves the effect of sequence annotation.
Drawings
FIG. 1 is a schematic diagram of the overall structure of the present invention.
FIG. 2 is a schematic diagram of a sequence labeling method based on a multi-head self-attention mechanism.
Detailed Description
The present invention will be described in further detail with reference to specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Example 1
The invention first learns the context semantic features of words in a text by using a bidirectional long and short term memory unit (BLSTM). And then, modeling the semantic relation between any two words in the text by adopting a multi-head self-attention mechanism based on the hidden representation learned by the BLSTM, and further obtaining the global semantics of each word to be focused on. In order to fully consider the complementarity of local context semantics and global semantics, the invention designs three feature fusion modes to fuse the two parts of semantics, and uses a conditional random field model (CRF) to predict a label sequence based on the fused features.
Example 2
The invention mainly adopts a deep learning technology and a theoretical method related to natural language processing to realize a sequence labeling task, and in order to ensure the normal operation of a system, in the specific implementation, a computer platform is required to be provided with a memory not lower than 8G, a CPU core number is not lower than 4, a main frequency is not lower than 2.6GHz, a GPU environment and a Linux operating system are required, and necessary software environments such as Python 3.6 and above versions, pytorch0.4 and above versions and the like are installed.
As shown in fig. 1, the sequence annotation method based on the multi-head attention-machine system provided by the present invention mainly includes the following steps executed in sequence:
step 1, local context semantic coding: the local context semantic representation of words in text is learned sequentially using a bidirectional long-short term memory network (BLSTM).
Step 1.1) the Stanford NLP toolkit is adopted to perform word segmentation on the input text to obtain a corresponding word sequence.
Step 1.2) for each word in the word sequence, encoding character-level vector representation corresponding to each word by using a Bidirectional LSTM (BLSTM) structure.
Step 1.3) for each word in the text, splicing the character-level vector representation encoded in step 1.2) with the word-embedded vector representation as the initial semantic representation of the word.
Step 1.4) using local context semantic representation of each word in the BLSTM encoded text: inputting the initial semantic representation of the words obtained in the step 1.3), and outputting the local context semantic representation of each word.
Step 2, global semantic coding: encoding a global semantic representation of the word using a multi-headed self-attention mechanism based on the local context semantic representation of the word encoded in step 1).
Step 2.1) mapping the local context semantic representation of the word encoded in step 1 to a plurality of different feature subspaces using a full connectivity layer.
Step 2.2) utilizing a self-attention mechanism to encode semantic representation of the words under different feature subspaces obtained in the step 2.1).
And 2.3) splicing the semantic representations under each feature subspace calculated in the step 2.2), and inputting the splicing result into a full connection layer to obtain the global semantic representation corresponding to each word.
Step 3, semantic feature fusion: and constructing the following three feature fusion modes, fusing the local semantic representation coded in the step 1) and the global semantic representation coded in the step 2), and taking the fusion result as the input semantic feature of the step 4.
And 3.1) constructing a one-dimensional parameter fusion method to realize the linear combination of local semantics and global semantics.
And 3.2) constructing a multi-dimensional parameter fusion method by using a gating mechanism adopted in the LSTM for reference.
And 3.3) constructing a free weight semantic fusion method.
And 4, sequence labeling: in order to fully consider the dependency relationship among the labels in the sequence labeling task, the CRF is utilized to predict the labels.
And 4.1) carrying out full-connection transformation on the semantic feature sequence obtained in the step 3 after fusion to obtain a state feature matrix, and representing the association between the semantics of each word and the label.
And 4.2) randomly initializing a transfer characteristic matrix to represent the transfer relation between the labels.
And 4.3) calculating the corresponding score and probability of any possible label sequence based on the state characteristic matrix obtained in the step 4.1) and the transfer characteristic matrix obtained in the step 4.2).
Step 5, model training: in the model training process, the parameters in the steps 1 to 4 are optimized by adopting the probability corresponding to the maximized standard label sequence.
Step 6, model reasoning: in the practical application process, the optimal label sequence is searched by adopting a Viterbi algorithm, and model reasoning is carried out.
Example 3
The sequence labeling method based on the multi-head self-attention mechanism mainly comprises the following steps of sequentially executing:
step 1, local context semantic coding: the local context semantic representation of words in text is learned sequentially using a bidirectional long-short term memory network (BLSTM).
Step (ii) of1.1, using Stanford NLP tool bag to perform word segmentation on the input text to obtain the corresponding word sequence X ═ { X ═ X1,x2,…,xN}。
For example, given the text "i am participating in a marathon race at Tianjin yesterday", the word sequence { "i", "yesterday", "in", "Tianjin", "participating in", "having", "a", "marathon", "race" } is obtained after word segmentation.
Step 1.2, considering that the words in the text usually contain rich morphological characteristics, such as prefix and suffix information, this step is performed for each word in the word sequenceEncoding each word x using a bi-directional LSTM (BLSTM) structureiCorresponding character-level vector representationWherein, ci,jThe jth character representing the ith word in the text.
For example: for the 4 th word "Tianjin" in the sequence of words, its 1 st character is "day" and its 2 nd character is "jin". By BLSTM encoding, a character-level vector representation of "Tianjin" can be obtained
Step 1.3) for each word in the text, firstly, the index of each word in the predefined word list is found by using a table look-up method, and the corresponding vector representation is searched from the pre-trained word vector set by using the index and is used as the word embedding vector representation of the wordSubsequently, the character-level vector representation encoded in step 1.2) is representedWord-embedded vector representation corresponding to a wordConcatenating as an initial semantic representation e of the wordi:
For example, for the 4 th word "Tianjin" in the sequence of words, its corresponding word embedding vector is represented asThe initial semantic representation e of Tianjin can be obtained by splicing the character-level features and the word embedding vectors4=[0.04,-0.77,…,0.31;0.11,0.89,…,-0.25]。
Step 1.4) based on the initial semantic expression sequence E ═ { E) of the words obtained in step 1.3)1,e2,…,eNUsing BLSTM to encode each word x in the textiIs used to represent the local context semantic representation hi:
For example, when the text is BLSTM encoded, the local context semantic corresponding to the 4 th word "tianjin" in the word sequence is represented as h4=[0.02,0.11,…,0.76]。
Step 2) global semantic coding: encoding a global semantic representation of the word using a multi-headed self-attention mechanism based on the local context semantic representation of the word encoded in step 1).
Step 2.1) to learn more diverse global semantic representations using the self-attention mechanism, the present inventionStep of adopting a full connection layer to represent the local context semantic expression sequence H ═ H of the words coded in the step 1) in a semantic mode1,h2,…,hNMaps to M different feature subspaces. The mapping mode of the ith feature subspace is as follows:
in the formula,andis a model parameter; q represents a query in the attention mechanism, K represents a keyword, and V represents a value corresponding to the keyword.
For example, the context semantic representation sequence encoded in step 1) may beThrough the transformation of the full connection layer, the query required by the attention mechanism in the ith feature subspace can be obtainedKeywordSum value
Step 2.2) encoding the semantic representation of the word by using a self-attention mechanism based on dot product under different feature subspaces obtained in step 2.1):
headi=Attention(Qi,Ki,Vi) (8)
in the formula (d)kRepresenting the dimension of the feature in the subspace, and T represents the transpose operation of the matrix.
For example, in the ith feature subspace, the semantic representation encoded by the attention mechanism may be
Step 2.3) representing the semantic meaning head under each feature subspace calculated in the step 2.2)iAnd splicing, and inputting a splicing result into a full connection layer to obtain a global semantic representation sequence Z corresponding to each word.
Z=[head1;head2;…;headM]Wz (10)
In the formula, WzAre model parameters.
For example, through splicing and full connection layers, a global semantic representation sequence can be obtained
Step 3), semantic feature fusion: although the attention mechanism is not limited by distance when modeling semantics or syntax is dependent and can make up for the defects of BLSTM long-distance semantic modeling, the attention mechanism is an unordered computing mechanism and the context on the sequence can be lost in the modeling process. Therefore, three feature fusion modes are constructed in the step, the local semantic features H learned by BLSTM in the step 2) and the global semantic features Z learned by the multi-head self-attention mechanism in the step 3) are fused, the effect of advantage complementation is achieved, and the fusion result U is used as the input semantic features in the step 4).
Step 3.1) one-dimensional parameter fusion method: for the ith word in the text, the corresponding local context semantic meaning h is firstly expressediAnd a global semantic representation ziSplicing, mapping the obtained object to a one-dimensional space by using a full connection layer, and obtaining a fusion weight beta by using sigmoid as an activation functioni:
βi=sigmoid(Wβ[hi;zi]) (11)
Semantic representation after one-dimensional parameter fusion: u. ofi=(1-βi)·hi+βi·zi (12)
In the formula, WβAre model parameters.
For example, the local context semantic corresponding to the 4 th word "Tianjin" in the word sequence is represented as h4=[0.02,0.11,…,0.76]The global semantic representation is z4=[0.14,0.09,…,-0.26]. Through calculation, beta is obtained4When the value is 0.4, the fused semantic representation u4=[0.07,0.10,…,0.35]。
Step 3.2) the multi-dimensional parameter fusion method comprises the following steps: the method uses a gating mechanism in the LSTM for reference, and for the ith word in the text, the corresponding local semantic meaning is firstly expressed as hiAnd a global semantic representation ziSplicing, mapping the full connection layer to a weight space with the same dimension as the semantic representation by using a full connection layer, and obtaining a fusion weight vector alpha by using sigmoid as an activation functioni:
αi=sigmoid(Wα[hi;zi]) (13)
In the formula, WαAre model parameters. And then, fusing the local semantics and the global semantics by adopting a method of multiplying corresponding elements:
and (3) carrying out semantic representation after multi-dimensional parameter fusion: u. ofi=(I-αi)⊙hi+αi⊙zi (14)
In the formula, l represents multiplication of elements, and l represents a column vector in which all elements are 1.
E.g. in a sequence of wordsThe 4 th word "Tianjin" is expressed as h4=[0.02,0.11,…,0.76]The global semantic representation is z4=[0.14,0.09,…,-0.26]. Through calculation, alpha is obtained4=[0.31,0.1,…,0.4]Then the fused semantic representation u4=[0.06,0.11,…,0.35]。
Step 3.3) constructing a free weight semantic fusion method, specifically, randomly initializing two trainable parameters gamma and delta in the step, and performing semantic feature fusion by using the two parameters:
ui=γi⊙hi+δi⊙zi (15)
for example, the local context semantic corresponding to the 4 th word "Tianjin" in the word sequence is represented as h4=[0.02,0.11,…,0.76]The global semantic representation is z4=[0.14,0.09,…,-0.26]. Through model optimization, gamma4=[0.19,0.52,…,-0.11],δi=[-0.22,0.98,…,0.17]Then the fused semantic representation u4=[-0.03,0.15,…,0.13]。
Step 4), sequence labeling: in order to fully consider the dependency relationship among the labels in the sequence labeling task, the CRF is utilized to predict the labels.
Step 4.1) merging semantic feature sequences U ═ U { U } obtained in step 3)1,u2,…,uN}: (in practical application, a mode is selected from the step 3) to fuse the features, and the fusion result is used as the input of the step 4)) to perform full-connection transformation to obtain a state feature matrix P which represents the association between the semantics of each word and the label
P=UWp+bp (16)
In the formula, WpAnd bpAre model parameters.
For example, for the 4 th word "Tianjin" in the sequence of words, its state feature may be p4=[0.01,0.91,…,0.00]。
And 4.2) randomly initializing a transfer characteristic matrix A to represent the transfer relation between the labels, wherein the matrix is optimized through loss back propagation in the model training process.
Step 4.3) calculating any one possible label sequence based on the state feature matrix obtained in step 4.1) and the transfer feature matrix obtained in step 4.2)The corresponding score is:
based on the score, calculating the probability corresponding to the label sequence:
for example, for named entity recognition tasks, the word sequence { "i", "yesterday", "in", "Tianjin", "join", "having", "one", "marathon", "match" } corresponds to the tag sequenceThe corresponding probability is 0.9.
Step 5) model training: in the model training process, the method maximizes the probability P (Y | X) corresponding to the standard label sequence. Therefore, the present invention optimizes the parameters in steps 1) to 4) in a manner that minimizes the following negative log-likelihood function:
step 6) model reasoning: in the practical application process, the invention adopts the Viterbi algorithm to search the optimal label sequence:
the foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.
Claims (10)
1. A sequence labeling method based on a multi-head self-attention mechanism is characterized by comprising the following steps of sequentially executing:
step 1, local context semantic coding, namely learning local context semantic representation of words in a text in a BLSTM serialization manner:
step 1.1, performing word segmentation on an input text to obtain a corresponding word sequence;
step 1.2, for each word in the word sequence, coding character-level vector representation corresponding to each word by using a BLSTM structure;
step 1.3, for each word in the word sequence, splicing the character-level vector representation and the word embedding vector representation coded in the step 1.2 to serve as word initial semantic representation;
step 1.4, based on the word initial semantic representation obtained in step 1.3, using BLSTM to encode the local context semantic representation of each word;
step 2, global semantic coding, namely coding the global semantic representation of the words by utilizing a multi-head self-attention mechanism based on the local context semantic representation of the words coded in the step 1:
step 2.1, mapping the local context semantic representation of the words coded in step 1 to a plurality of different feature subspaces by adopting a full connection layer;
step 2.2, under different feature subspaces obtained in step 2.1, utilizing a self-attention mechanism to encode semantic representation of words;
step 2.3, the semantic representations of the words in each feature subspace calculated in the step 2.2 are spliced, and the splicing result is input into a full connection layer to obtain the global semantic representation corresponding to each word;
and 3, semantic feature fusion, namely constructing the following three feature fusion modes, fusing the local context semantic representation coded in the step 1 and the global semantic representation coded in the step 2, and taking a fusion result as an input semantic feature in the step 4:
step 3.1, constructing a one-dimensional parameter fusion method to realize the linear combination of local context semantics and global semantics;
step 3.2, building a multi-dimensional parameter fusion method by using a gating mechanism adopted in the LSTM for reference;
3.3, constructing a free weight semantic fusion method;
and 4, carrying out sequence annotation, namely predicting the labels by using CRF (fuzzy C) in order to fully consider the dependency relationship among the labels in the sequence annotation task:
step 4.1, performing full-connection transformation on the fused semantic feature sequence obtained in the step 3 to obtain a state feature matrix, and representing the association between the semantics of each word and the label;
step 4.2, initializing a transfer characteristic matrix randomly to express the transfer relation between the labels;
step 4.3, calculating the corresponding score and probability of any possible label sequence based on the state characteristic matrix obtained in the step 4.1 and the transfer characteristic matrix obtained in the step 4.2;
step 5, model training: in the model training process, optimizing the parameters in the steps 1 to 4 by adopting the probability corresponding to the maximized standard label sequence;
step 6, model reasoning: in the practical application process, the optimal label sequence is searched by adopting a Viterbi algorithm, and model reasoning is carried out.
2. The method for labeling sequences based on the multi-head self-attention mechanism as claimed in claim 1, wherein in step 1.1, the Stanford NLP toolkit is used to perform word segmentation on the input text.
4. The method for labeling sequences based on the multi-head attention mechanism as claimed in claim 1, wherein in step 1.4, the initial semantic representation sequence E ═ { E } based on the words obtained in step 1.3 is used1,e2,...,eNUsing BLSTM to encode each word x in the textiIs used to represent the local context semantic representation hi:
5. The method for labeling sequences based on the multi-head attention mechanism as claimed in claim 1, wherein in step 2.1, the word local context encoded in step 1 is usedSemantic representation sequence H ═ H1,h2,...,hNMapping to M different feature subspaces, wherein the mapping mode of the ith feature subspace is as follows:
6. The method for sequence annotation based on multi-head attention mechanism as claimed in claim 1, wherein in step 2.2, the semantic representation of the word is encoded by using the self-attention mechanism based on dot product under different feature subspaces obtained in step 2.1:
headi=Attention(Qi,Ki,Vi)
in the formula (d)kRepresenting the dimension of the feature in the subspace, and T represents the transpose operation of the matrix.
7. The method for sequence annotation based on multi-head attention mechanism as claimed in claim 1, wherein in step 2.3, the semantic representation head under each feature subspace calculated in step 2.2 is usediAnd splicing, inputting a splicing result into a full-connection layer, and obtaining a global semantic representation sequence Z corresponding to each word:
Z=[head1;head2;...;headM]Wzin the formula, WzAre model parameters.
8. The method for sequence annotation based on the multi-head attention mechanism as claimed in claim 1, wherein in the step 3, the semantic representation after the one-dimensional parameter fusion: u. ofi=(1-βi)·hi+βi·zi,
In the formula betai=sigmoid(Wβ[hi;zi]),hiFor local context semantic representation, ziFor global semantic representation, WβIs a model parameter;
and (3) carrying out semantic representation after multi-dimensional parameter fusion: u. ofi=(I-αi)⊙hi+αi⊙zi,
In the formula of alphai=sigmoid(Wα[hi;zi]) ,αis a model parameter;
constructing a semantic representation of semantic fusion of free weights: u. ofi=γi⊙hi+δi⊙zi,
Where γ and δ are two trainable parameters.
9. The method for sequence annotation based on the multi-head attention mechanism as claimed in claim 1, wherein in step 4.1, the fused semantic feature sequence U ═ { U ═ obtained in step 31,u2,...,uNCarry out full connection transformation to obtain a state feature matrix P, which represents eachAssociation between the semantics of the word and the label:
P=UWp+bp
in the formula, WpAnd bpAre model parameters.
10. The method for sequence annotation based on multi-head attention mechanism as claimed in claim 1, wherein in step 4.3, any one of the possible tag sequence is calculated based on the state feature matrix obtained in step 4.1 and the transition feature matrix obtained in step 4.2The corresponding score is:
based on the score, calculating the probability corresponding to the label sequence:
in the model training process, the probability P (Y | X) corresponding to the standard label sequence is maximized, and the parameters in the steps 1 to 4 are optimized by adopting a mode of minimizing the following negative log-likelihood function:
in step 6, the optimal tag sequence is searched by using the viterbi algorithm:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011187198.0A CN112380863A (en) | 2020-10-29 | 2020-10-29 | Sequence labeling method based on multi-head self-attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011187198.0A CN112380863A (en) | 2020-10-29 | 2020-10-29 | Sequence labeling method based on multi-head self-attention mechanism |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112380863A true CN112380863A (en) | 2021-02-19 |
Family
ID=74576393
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202011187198.0A Pending CN112380863A (en) | 2020-10-29 | 2020-10-29 | Sequence labeling method based on multi-head self-attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112380863A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112967112A (en) * | 2021-03-24 | 2021-06-15 | 武汉大学 | Electronic commerce recommendation method for self-attention mechanism and graph neural network |
CN112990434A (en) * | 2021-03-09 | 2021-06-18 | 平安科技(深圳)有限公司 | Training method of machine translation model and related device |
CN113010685A (en) * | 2021-02-23 | 2021-06-22 | 安徽科大讯飞医疗信息技术有限公司 | Medical term standardization method, electronic device, and storage medium |
CN113158051A (en) * | 2021-04-23 | 2021-07-23 | 山东大学 | Label sorting method based on information propagation and multilayer context information modeling |
CN113240098A (en) * | 2021-06-16 | 2021-08-10 | 湖北工业大学 | Fault prediction method and device based on hybrid gated neural network and storage medium |
CN113378243A (en) * | 2021-07-14 | 2021-09-10 | 南京信息工程大学 | Personalized federal learning method based on multi-head attention mechanism |
CN114462406A (en) * | 2022-03-01 | 2022-05-10 | 中国航空综合技术研究所 | Method for acquiring first-appearing aviation keywords based on multi-head self-attention model |
CN115796173A (en) * | 2023-02-20 | 2023-03-14 | 杭银消费金融股份有限公司 | Data processing method and system for supervision submission requirements |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109697285A (en) * | 2018-12-13 | 2019-04-30 | 中南大学 | Enhance the hierarchical B iLSTM Chinese electronic health record disease code mask method of semantic expressiveness |
CN110457480A (en) * | 2019-08-16 | 2019-11-15 | 国网天津市电力公司 | The construction method of fine granularity sentiment classification model based on interactive attention mechanism |
CN111274398A (en) * | 2020-01-20 | 2020-06-12 | 福州大学 | Method and system for analyzing comment emotion of aspect-level user product |
CN111767409A (en) * | 2020-06-14 | 2020-10-13 | 南开大学 | Entity relationship extraction method based on multi-head self-attention mechanism |
CN111783394A (en) * | 2020-08-11 | 2020-10-16 | 深圳市北科瑞声科技股份有限公司 | Training method of event extraction model, event extraction method, system and equipment |
-
2020
- 2020-10-29 CN CN202011187198.0A patent/CN112380863A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109697285A (en) * | 2018-12-13 | 2019-04-30 | 中南大学 | Enhance the hierarchical B iLSTM Chinese electronic health record disease code mask method of semantic expressiveness |
CN110457480A (en) * | 2019-08-16 | 2019-11-15 | 国网天津市电力公司 | The construction method of fine granularity sentiment classification model based on interactive attention mechanism |
CN111274398A (en) * | 2020-01-20 | 2020-06-12 | 福州大学 | Method and system for analyzing comment emotion of aspect-level user product |
CN111767409A (en) * | 2020-06-14 | 2020-10-13 | 南开大学 | Entity relationship extraction method based on multi-head self-attention mechanism |
CN111783394A (en) * | 2020-08-11 | 2020-10-16 | 深圳市北科瑞声科技股份有限公司 | Training method of event extraction model, event extraction method, system and equipment |
Non-Patent Citations (2)
Title |
---|
张志昌等: "融合局部语义和全局结构信息的健康问句分类", 《西安电子科技大学学报》 * |
王旭强等: "基于注意力机制的特征融合序列标注模型", 《HTTPS://KNS.CNKI.NET/KCMS/DETAIL/37.1357.N.20200619.1603.002.HTML》 * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113010685B (en) * | 2021-02-23 | 2022-12-06 | 安徽讯飞医疗股份有限公司 | Medical term standardization method, electronic device, and storage medium |
CN113010685A (en) * | 2021-02-23 | 2021-06-22 | 安徽科大讯飞医疗信息技术有限公司 | Medical term standardization method, electronic device, and storage medium |
CN112990434A (en) * | 2021-03-09 | 2021-06-18 | 平安科技(深圳)有限公司 | Training method of machine translation model and related device |
CN112990434B (en) * | 2021-03-09 | 2023-06-20 | 平安科技(深圳)有限公司 | Training method of machine translation model and related device |
CN112967112B (en) * | 2021-03-24 | 2022-04-29 | 武汉大学 | Electronic commerce recommendation method for self-attention mechanism and graph neural network |
CN112967112A (en) * | 2021-03-24 | 2021-06-15 | 武汉大学 | Electronic commerce recommendation method for self-attention mechanism and graph neural network |
CN113158051B (en) * | 2021-04-23 | 2022-11-18 | 山东大学 | Label sorting method based on information propagation and multilayer context information modeling |
CN113158051A (en) * | 2021-04-23 | 2021-07-23 | 山东大学 | Label sorting method based on information propagation and multilayer context information modeling |
CN113240098A (en) * | 2021-06-16 | 2021-08-10 | 湖北工业大学 | Fault prediction method and device based on hybrid gated neural network and storage medium |
CN113378243A (en) * | 2021-07-14 | 2021-09-10 | 南京信息工程大学 | Personalized federal learning method based on multi-head attention mechanism |
CN113378243B (en) * | 2021-07-14 | 2023-09-29 | 南京信息工程大学 | Personalized federal learning method based on multi-head attention mechanism |
CN114462406A (en) * | 2022-03-01 | 2022-05-10 | 中国航空综合技术研究所 | Method for acquiring first-appearing aviation keywords based on multi-head self-attention model |
CN115796173A (en) * | 2023-02-20 | 2023-03-14 | 杭银消费金融股份有限公司 | Data processing method and system for supervision submission requirements |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112380863A (en) | Sequence labeling method based on multi-head self-attention mechanism | |
CN109992782B (en) | Legal document named entity identification method and device and computer equipment | |
CN108416058B (en) | Bi-LSTM input information enhancement-based relation extraction method | |
CN106502985B (en) | neural network modeling method and device for generating titles | |
CN111666427B (en) | Entity relationship joint extraction method, device, equipment and medium | |
CN109062901B (en) | Neural network training method and device and name entity recognition method and device | |
CN110232192A (en) | Electric power term names entity recognition method and device | |
CN112541356B (en) | Method and system for recognizing biomedical named entities | |
CN111767409A (en) | Entity relationship extraction method based on multi-head self-attention mechanism | |
CN113987169A (en) | Text abstract generation method, device and equipment based on semantic block and storage medium | |
CN111881256B (en) | Text entity relation extraction method and device and computer readable storage medium equipment | |
CN111368542A (en) | Text language association extraction method and system based on recurrent neural network | |
CN114298053A (en) | Event joint extraction system based on feature and attention mechanism fusion | |
CN113326367B (en) | Task type dialogue method and system based on end-to-end text generation | |
CN110874536A (en) | Corpus quality evaluation model generation method and bilingual sentence pair inter-translation quality evaluation method | |
CN111737497B (en) | Weak supervision relation extraction method based on multi-source semantic representation fusion | |
CN116932722A (en) | Cross-modal data fusion-based medical visual question-answering method and system | |
CN112163089A (en) | Military high-technology text classification method and system fusing named entity recognition | |
Xu et al. | Match-prompt: Improving multi-task generalization ability for neural text matching via prompt learning | |
CN115169349A (en) | Chinese electronic resume named entity recognition method based on ALBERT | |
CN116341564A (en) | Problem reasoning method and device based on semantic understanding | |
CN117610562B (en) | Relation extraction method combining combined category grammar and multi-task learning | |
CN113076718B (en) | Commodity attribute extraction method and system | |
CN115019142A (en) | Image title generation method and system based on fusion features and electronic equipment | |
CN116680575B (en) | Model processing method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210219 |
|
RJ01 | Rejection of invention patent application after publication |