CN112380863A - Sequence labeling method based on multi-head self-attention mechanism - Google Patents

Sequence labeling method based on multi-head self-attention mechanism Download PDF

Info

Publication number
CN112380863A
CN112380863A CN202011187198.0A CN202011187198A CN112380863A CN 112380863 A CN112380863 A CN 112380863A CN 202011187198 A CN202011187198 A CN 202011187198A CN 112380863 A CN112380863 A CN 112380863A
Authority
CN
China
Prior art keywords
word
sequence
semantic
semantic representation
attention mechanism
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011187198.0A
Other languages
Chinese (zh)
Inventor
孟洁
李妍
刘晨
张倩宜
王梓蒴
单晓怡
李慕轩
王林
刘赫
董雅茹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
State Grid Tianjin Electric Power Co Ltd
Original Assignee
State Grid Corp of China SGCC
State Grid Tianjin Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, State Grid Tianjin Electric Power Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN202011187198.0A priority Critical patent/CN112380863A/en
Publication of CN112380863A publication Critical patent/CN112380863A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a sequence labeling method based on a multi-head self-attention mechanism, which comprises the following steps of: step 1, local context semantic coding, namely utilizing BLSTM (binary block notation) serialization to learn local context semantic representation of words in a text, and step 2, global semantic coding, namely utilizing a multi-head self-attention mechanism to code global semantic representation of words based on the local context semantic representation of the words coded in the step 1; and 3, semantic feature fusion, namely fusing the local context semantic representation coded in the step 1 and the global semantic representation coded in the step 2, taking a fusion result as an input semantic feature of the step 4, carrying out sequence annotation, predicting labels by using CRF (fuzzy rule) in order to fully consider the dependency relationship among the labels in a sequence annotation task, and carrying out model training, training a model and reasoning the model 6. The invention further introduces a multi-head self-attention mechanism on the basis of the cyclic neural network to learn the global semantic representation of the words, and improves the sequence labeling effect.

Description

Sequence labeling method based on multi-head self-attention mechanism
Technical Field
The invention relates to the technical field of computer application, in particular to a sequence labeling method based on a multi-head self-attention machine mechanism.
Background
Sequence tagging is an important research topic in natural language processing tasks, and aims to predict a corresponding tag sequence based on a given Text sequence, and mainly comprises tasks such as Named Entity Recognition (NER), chunk parsing (Text Chunking), Part-Of-Speech tagging (POS), and Opinion Extraction (Opinion Extraction).
The early sequence labeling method is mostly based on rules, a rule template and a large amount of expert knowledge need to be established, a large amount of manpower and material resources are consumed, and meanwhile, the method is not easy to expand and transplant to other fields. Such as waning et al, manually establish a knowledge base of financial company name identification in a rule-based manner. Toral and Mu automatically build and maintain a gazetteers (name, organization, place, and other entity list) for entity identification based on online wikipedia analysis. Zizhenning et al constructed and customized a named entity recognition marker, which, although field-adaptive and achieved good experimental results, was still manually-based and time-consuming.
Due to the shortcomings of rule-based methods, machine learning models based on statistical learning methods are increasingly being applied to sequence labeling, such as Support Vector Machines (SVMs), Hidden Markov Models (HMMs), Conditional Random Fields (CRFs), maximum entropy Models (MEs), and the like. For example, Mayfield et al utilize SVM to capture hundreds of features from training data for training. Zhou and Su propose an HMM-based named entity recognition system that can apply and fuse simple features of words (such as case, number, etc.). Mccallum and Li apply CRFs in named entity recognition, performing well on multiple datasets. Liu Yan super et al apply the ME model to named entity recognition, while incorporating a method of fusing local features and global features inside sentences. Although the method based on the statistical learning model achieves better performance, the method still depends on artificial features seriously and has the defect that only local features can be captured.
In recent years, with the rapid development of deep learning, the strong learning and automatic feature extraction capabilities thereof have been successful in natural language processing. Thus, deep learning is also widely used in many tasks of sequence annotation. For example, Zhang Miao et al applied the BLSTM-CRF framework model to sequence annotation, which achieved the most competitive performance because BLSTM effectively utilized context features and CRF modeled sentence-level tag information. Chiu proposes a novel model BLSTM-CNN, obtains character characteristics through CNN, and is spliced with word embedding and sent into BLSTM, although the effect is good, dictionary or vocabulary characteristics are used. Recently, attention has been drawn to the mechanism that is increasingly applied to the many tasks of sequence annotation. Compared with the semantic dependence of LSTM or CNN in modeling, the attention mechanism is not concerned about the length of distance. For example, Rei et al combine attention mechanism learning weight coefficients on the basis of the BLSTM-CRF framework, and input the weighted sum of the two features into the CRF for label prediction. Luo et al demonstrate that the introduction of the attention mechanism into BLSTM-CRF can improve the chemical drug entity recognition effect, can improve the labeling consistency at the document level, and can enrich context information at the sentence level. Tan et al propose to use a depth attention network for sequence annotation, using a depth model of N layers, each layer comprising a non-linear layer and a self-attention layer, and taking the output of the highest layer as the input of the softmax layer. Although the existing deep learning-based method achieves better performance, the defects of local dependency, inaccurate position information acquisition and the like still exist.
In summary, most of the existing sequence labeling methods are constructed based on an LSTM-CRF framework, but learning the context semantic representation of words in a text by using an LSTM as an encoder usually has two problems: first, the cyclic neural network-based sequence annotation model usually has local dependency, and there is semantic loss for semantic information at a long distance. And the longer the distance between two words, the more obvious this problem is. Secondly, the sequence labeling model based on the recurrent neural network is limited by serialized feature learning, and thus the semantic relationship between any two words in the text cannot be flexibly modeled.
Disclosure of Invention
The invention aims to provide a sequence annotation method based on a multi-head self-attention mechanism aiming at the problems of local dependency and serialization coding in the sequence annotation method in the prior art,
the technical scheme adopted for realizing the purpose of the invention is as follows:
a sequence labeling method based on a multi-head self-attention mechanism comprises the following steps executed in sequence:
step 1, local context semantic coding, namely learning local context semantic representation of words in a text in a BLSTM serialization manner:
step 1.1, performing word segmentation on an input text to obtain a corresponding word sequence;
step 1.2, for each word in the word sequence, coding character-level vector representation corresponding to each word by using a BLSTM structure;
step 1.3, for each word in the word sequence, splicing the character-level vector representation and the word embedding vector representation coded in the step 1.2 to serve as word initial semantic representation;
step 1.4, based on the word initial semantic representation obtained in step 1.3, using BLSTM to encode the local context semantic representation of each word;
step 2, global semantic coding, namely coding the global semantic representation of the words by utilizing a multi-head self-attention mechanism based on the local context semantic representation of the words coded in the step 1:
step 2.1, mapping the local context semantic representation of the words coded in step 1 to a plurality of different feature subspaces by adopting a full connection layer;
step 2.2, under different feature subspaces obtained in step 2.1, utilizing a self-attention mechanism to encode semantic representation of words;
step 2.3, the semantic representations of the words in each feature subspace calculated in the step 2.2 are spliced, and the splicing result is input into a full connection layer to obtain the global semantic representation corresponding to each word;
and 3, semantic feature fusion, namely constructing the following three feature fusion modes, fusing the local context semantic representation coded in the step 1 and the global semantic representation coded in the step 2, and taking a fusion result as an input semantic feature in the step 4:
step 3.1, constructing a one-dimensional parameter fusion method to realize the linear combination of local context semantics and global semantics;
step 3.2, building a multi-dimensional parameter fusion method by using a gating mechanism adopted in the LSTM for reference;
3.3, constructing a free weight semantic fusion method;
and 4, carrying out sequence annotation, namely predicting the labels by using CRF (fuzzy C) in order to fully consider the dependency relationship among the labels in the sequence annotation task:
step 4.1, performing full-connection transformation on the fused semantic feature sequence obtained in the step 3 to obtain a state feature matrix, and representing the association between the semantics of each word and the label;
step 4.2, initializing a transfer characteristic matrix randomly to express the transfer relation between the labels;
step 4.3, calculating the corresponding score and probability of any possible label sequence based on the state characteristic matrix obtained in the step 4.1 and the transfer characteristic matrix obtained in the step 4.2;
step 5, model training: in the model training process, optimizing the parameters in the steps 1 to 4 by adopting the probability corresponding to the maximized standard label sequence;
step 6, model reasoning: in the practical application process, the optimal label sequence is searched by adopting a Viterbi algorithm, and model reasoning is carried out.
In the above technical solution, in the step 1.1, a Stanford NLP toolkit is used to perform word segmentation on an input text.
In the above technical solution, in the step 1.3, the initial semantic representation of the word
Figure BDA0002751772910000031
Wherein the content of the first and second substances,
Figure BDA0002751772910000032
for the character-level vector representation in question,
Figure BDA0002751772910000033
a vector representation is embedded for the word.
In the above technical solution, in step 1.4, the word initial semantic representation sequence E ═ E { based on step 1.3 is described1,e2,…,eNUsing BLSTM to encode each word x in the textiIs used to represent the local context semantic representation hi
Figure BDA0002751772910000034
Figure BDA0002751772910000035
Figure BDA0002751772910000036
In the above technical solution, in step 2.1, the word local context semantic representation sequence H encoded in step 1 is { H ═ H1,h2,…,hNMapping to M different feature subspaces, wherein the mapping mode of the ith feature subspace is as follows:
Figure BDA0002751772910000041
Figure BDA0002751772910000042
Figure BDA0002751772910000043
in the formula (I), the compound is shown in the specification,
Figure BDA0002751772910000044
and
Figure BDA0002751772910000045
is a model parameter; q represents a query in the attention mechanism, K represents a keyword, and V represents a value corresponding to the keyword.
In the above technical solution, in the step 2.2, in different feature subspaces obtained in the step 2.1, a self-attention mechanism based on dot product is used to encode semantic representation of a word:
headi=Attention(Qi,Ki,Vi)
Figure BDA0002751772910000046
in the formula (d)kRepresenting the dimension of the feature in the subspace, and T represents the transpose operation of the matrix.
In the above technical solution, in the step 2.3, the semantic representation head under each feature subspace calculated in the step 2.2 is usediAnd splicing, inputting a splicing result into a full-connection layer, and obtaining a global semantic representation sequence Z corresponding to each word:
Z=[head1;head2;…;headM]Wzin the formula, WzAre model parameters.
In the above technical solution, in step 3, the semantic representation after the one-dimensional parameter fusion: u. ofi=(1-βi)·hii·zi
In the formula betai=sigmoid(Wβ[hi;zi]),hiFor local context semantic representation, ziFor global semantic representation, WβIs a model parameter;
and (3) carrying out semantic representation after multi-dimensional parameter fusion: u. ofi=(I-αi)⊙hii⊙zi
In the formula of alphai=sigmoid(Wα[hi;zi]) ,αis a model parameter;
constructing a semantic representation of semantic fusion of free weights: u. ofi=γi⊙hii⊙zi
Where γ and δ are two trainable parameters.
In the above technical solution, in step 4.1, the fused semantic feature sequence U obtained in step 3 is set to { U ═1,u2,…,uNCarry out full connection transformation to obtain state feature matrix P, tableIndicating the association between the semantics of each word and the label:
P=UWp+bp
in the formula, WpAnd bpAre model parameters.
In the above technical solution, in the step 4.3, any one of possible label sequences is calculated based on the state feature matrix obtained in the step 4.1 and the transfer feature matrix obtained in the step 4.2
Figure BDA0002751772910000051
The corresponding score is:
Figure BDA0002751772910000052
based on the score, calculating the probability corresponding to the label sequence:
Figure BDA0002751772910000053
in the model training process, the probability P (Y | X) corresponding to the standard label sequence is maximized, and the parameters in the steps 1 to 4 are optimized by adopting a mode of minimizing the following negative log-likelihood function:
Figure BDA0002751772910000054
in step 6, the optimal tag sequence is searched by using the viterbi algorithm:
Figure BDA0002751772910000055
compared with the prior art, the invention has the beneficial effects that:
1. the invention further introduces a multi-head self-attention mechanism on the basis of the cyclic neural network to learn the global semantic representation of the words, improves the sequence labeling effect and effectively relieves the problems of local dependency and sequence coding brought by coding by using the cyclic neural network.
2. The local context semantics of the cyclic neural network coding comprehensively considers the short-distance semantics of the words and the word order relation between the words, while the global semantics of the multi-head self-attention mechanism coding can not be limited by distance in the modeling semantics, thus making up for the defect of long-distance semantic modeling in the cyclic neural network, but lacking in the modeling of the word order. Therefore, the local semantics and the global semantics have certain complementarity, the invention comprehensively considers the two semantics, constructs a fusion method of three semantic features, fuses the local semantic features learned by the BLSTM and the global semantic features learned by the multi-head self-attention mechanism, achieves the effect of advantage complementation, takes the fusion result as the input semantic features, and improves the effect of sequence annotation.
Drawings
FIG. 1 is a schematic diagram of the overall structure of the present invention.
FIG. 2 is a schematic diagram of a sequence labeling method based on a multi-head self-attention mechanism.
Detailed Description
The present invention will be described in further detail with reference to specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Example 1
The invention first learns the context semantic features of words in a text by using a bidirectional long and short term memory unit (BLSTM). And then, modeling the semantic relation between any two words in the text by adopting a multi-head self-attention mechanism based on the hidden representation learned by the BLSTM, and further obtaining the global semantics of each word to be focused on. In order to fully consider the complementarity of local context semantics and global semantics, the invention designs three feature fusion modes to fuse the two parts of semantics, and uses a conditional random field model (CRF) to predict a label sequence based on the fused features.
Example 2
The invention mainly adopts a deep learning technology and a theoretical method related to natural language processing to realize a sequence labeling task, and in order to ensure the normal operation of a system, in the specific implementation, a computer platform is required to be provided with a memory not lower than 8G, a CPU core number is not lower than 4, a main frequency is not lower than 2.6GHz, a GPU environment and a Linux operating system are required, and necessary software environments such as Python 3.6 and above versions, pytorch0.4 and above versions and the like are installed.
As shown in fig. 1, the sequence annotation method based on the multi-head attention-machine system provided by the present invention mainly includes the following steps executed in sequence:
step 1, local context semantic coding: the local context semantic representation of words in text is learned sequentially using a bidirectional long-short term memory network (BLSTM).
Step 1.1) the Stanford NLP toolkit is adopted to perform word segmentation on the input text to obtain a corresponding word sequence.
Step 1.2) for each word in the word sequence, encoding character-level vector representation corresponding to each word by using a Bidirectional LSTM (BLSTM) structure.
Step 1.3) for each word in the text, splicing the character-level vector representation encoded in step 1.2) with the word-embedded vector representation as the initial semantic representation of the word.
Step 1.4) using local context semantic representation of each word in the BLSTM encoded text: inputting the initial semantic representation of the words obtained in the step 1.3), and outputting the local context semantic representation of each word.
Step 2, global semantic coding: encoding a global semantic representation of the word using a multi-headed self-attention mechanism based on the local context semantic representation of the word encoded in step 1).
Step 2.1) mapping the local context semantic representation of the word encoded in step 1 to a plurality of different feature subspaces using a full connectivity layer.
Step 2.2) utilizing a self-attention mechanism to encode semantic representation of the words under different feature subspaces obtained in the step 2.1).
And 2.3) splicing the semantic representations under each feature subspace calculated in the step 2.2), and inputting the splicing result into a full connection layer to obtain the global semantic representation corresponding to each word.
Step 3, semantic feature fusion: and constructing the following three feature fusion modes, fusing the local semantic representation coded in the step 1) and the global semantic representation coded in the step 2), and taking the fusion result as the input semantic feature of the step 4.
And 3.1) constructing a one-dimensional parameter fusion method to realize the linear combination of local semantics and global semantics.
And 3.2) constructing a multi-dimensional parameter fusion method by using a gating mechanism adopted in the LSTM for reference.
And 3.3) constructing a free weight semantic fusion method.
And 4, sequence labeling: in order to fully consider the dependency relationship among the labels in the sequence labeling task, the CRF is utilized to predict the labels.
And 4.1) carrying out full-connection transformation on the semantic feature sequence obtained in the step 3 after fusion to obtain a state feature matrix, and representing the association between the semantics of each word and the label.
And 4.2) randomly initializing a transfer characteristic matrix to represent the transfer relation between the labels.
And 4.3) calculating the corresponding score and probability of any possible label sequence based on the state characteristic matrix obtained in the step 4.1) and the transfer characteristic matrix obtained in the step 4.2).
Step 5, model training: in the model training process, the parameters in the steps 1 to 4 are optimized by adopting the probability corresponding to the maximized standard label sequence.
Step 6, model reasoning: in the practical application process, the optimal label sequence is searched by adopting a Viterbi algorithm, and model reasoning is carried out.
Example 3
The sequence labeling method based on the multi-head self-attention mechanism mainly comprises the following steps of sequentially executing:
step 1, local context semantic coding: the local context semantic representation of words in text is learned sequentially using a bidirectional long-short term memory network (BLSTM).
Step (ii) of1.1, using Stanford NLP tool bag to perform word segmentation on the input text to obtain the corresponding word sequence X ═ { X ═ X1,x2,…,xN}。
For example, given the text "i am participating in a marathon race at Tianjin yesterday", the word sequence { "i", "yesterday", "in", "Tianjin", "participating in", "having", "a", "marathon", "race" } is obtained after word segmentation.
Step 1.2, considering that the words in the text usually contain rich morphological characteristics, such as prefix and suffix information, this step is performed for each word in the word sequence
Figure BDA0002751772910000081
Encoding each word x using a bi-directional LSTM (BLSTM) structureiCorresponding character-level vector representation
Figure BDA0002751772910000082
Wherein, ci,jThe jth character representing the ith word in the text.
For example: for the 4 th word "Tianjin" in the sequence of words, its 1 st character is "day" and its 2 nd character is "jin". By BLSTM encoding, a character-level vector representation of "Tianjin" can be obtained
Figure BDA0002751772910000083
Step 1.3) for each word in the text, firstly, the index of each word in the predefined word list is found by using a table look-up method, and the corresponding vector representation is searched from the pre-trained word vector set by using the index and is used as the word embedding vector representation of the word
Figure BDA0002751772910000084
Subsequently, the character-level vector representation encoded in step 1.2) is represented
Figure BDA0002751772910000085
Word-embedded vector representation corresponding to a word
Figure BDA0002751772910000086
Concatenating as an initial semantic representation e of the wordi
Figure BDA0002751772910000087
For example, for the 4 th word "Tianjin" in the sequence of words, its corresponding word embedding vector is represented as
Figure BDA0002751772910000088
The initial semantic representation e of Tianjin can be obtained by splicing the character-level features and the word embedding vectors4=[0.04,-0.77,…,0.31;0.11,0.89,…,-0.25]。
Step 1.4) based on the initial semantic expression sequence E ═ { E) of the words obtained in step 1.3)1,e2,…,eNUsing BLSTM to encode each word x in the textiIs used to represent the local context semantic representation hi
Figure BDA0002751772910000089
Figure BDA00027517729100000810
Figure BDA00027517729100000811
For example, when the text is BLSTM encoded, the local context semantic corresponding to the 4 th word "tianjin" in the word sequence is represented as h4=[0.02,0.11,…,0.76]。
Step 2) global semantic coding: encoding a global semantic representation of the word using a multi-headed self-attention mechanism based on the local context semantic representation of the word encoded in step 1).
Step 2.1) to learn more diverse global semantic representations using the self-attention mechanism, the present inventionStep of adopting a full connection layer to represent the local context semantic expression sequence H ═ H of the words coded in the step 1) in a semantic mode1,h2,…,hNMaps to M different feature subspaces. The mapping mode of the ith feature subspace is as follows:
Figure BDA0002751772910000091
Figure BDA0002751772910000092
Figure BDA0002751772910000093
in the formula (I), the compound is shown in the specification,
Figure BDA0002751772910000094
and
Figure BDA0002751772910000095
is a model parameter; q represents a query in the attention mechanism, K represents a keyword, and V represents a value corresponding to the keyword.
For example, the context semantic representation sequence encoded in step 1) may be
Figure BDA0002751772910000096
Through the transformation of the full connection layer, the query required by the attention mechanism in the ith feature subspace can be obtained
Figure BDA0002751772910000097
Keyword
Figure BDA0002751772910000098
Sum value
Figure BDA0002751772910000099
Step 2.2) encoding the semantic representation of the word by using a self-attention mechanism based on dot product under different feature subspaces obtained in step 2.1):
headi=Attention(Qi,Ki,Vi) (8)
Figure BDA00027517729100000910
in the formula (d)kRepresenting the dimension of the feature in the subspace, and T represents the transpose operation of the matrix.
For example, in the ith feature subspace, the semantic representation encoded by the attention mechanism may be
Figure BDA00027517729100000911
Step 2.3) representing the semantic meaning head under each feature subspace calculated in the step 2.2)iAnd splicing, and inputting a splicing result into a full connection layer to obtain a global semantic representation sequence Z corresponding to each word.
Z=[head1;head2;…;headM]Wz (10)
In the formula, WzAre model parameters.
For example, through splicing and full connection layers, a global semantic representation sequence can be obtained
Figure BDA00027517729100000912
Step 3), semantic feature fusion: although the attention mechanism is not limited by distance when modeling semantics or syntax is dependent and can make up for the defects of BLSTM long-distance semantic modeling, the attention mechanism is an unordered computing mechanism and the context on the sequence can be lost in the modeling process. Therefore, three feature fusion modes are constructed in the step, the local semantic features H learned by BLSTM in the step 2) and the global semantic features Z learned by the multi-head self-attention mechanism in the step 3) are fused, the effect of advantage complementation is achieved, and the fusion result U is used as the input semantic features in the step 4).
Step 3.1) one-dimensional parameter fusion method: for the ith word in the text, the corresponding local context semantic meaning h is firstly expressediAnd a global semantic representation ziSplicing, mapping the obtained object to a one-dimensional space by using a full connection layer, and obtaining a fusion weight beta by using sigmoid as an activation functioni
βi=sigmoid(Wβ[hi;zi]) (11)
Semantic representation after one-dimensional parameter fusion: u. ofi=(1-βi)·hii·zi (12)
In the formula, WβAre model parameters.
For example, the local context semantic corresponding to the 4 th word "Tianjin" in the word sequence is represented as h4=[0.02,0.11,…,0.76]The global semantic representation is z4=[0.14,0.09,…,-0.26]. Through calculation, beta is obtained4When the value is 0.4, the fused semantic representation u4=[0.07,0.10,…,0.35]。
Step 3.2) the multi-dimensional parameter fusion method comprises the following steps: the method uses a gating mechanism in the LSTM for reference, and for the ith word in the text, the corresponding local semantic meaning is firstly expressed as hiAnd a global semantic representation ziSplicing, mapping the full connection layer to a weight space with the same dimension as the semantic representation by using a full connection layer, and obtaining a fusion weight vector alpha by using sigmoid as an activation functioni
αi=sigmoid(Wα[hi;zi]) (13)
In the formula, WαAre model parameters. And then, fusing the local semantics and the global semantics by adopting a method of multiplying corresponding elements:
and (3) carrying out semantic representation after multi-dimensional parameter fusion: u. ofi=(I-αi)⊙hii⊙zi (14)
In the formula, l represents multiplication of elements, and l represents a column vector in which all elements are 1.
E.g. in a sequence of wordsThe 4 th word "Tianjin" is expressed as h4=[0.02,0.11,…,0.76]The global semantic representation is z4=[0.14,0.09,…,-0.26]. Through calculation, alpha is obtained4=[0.31,0.1,…,0.4]Then the fused semantic representation u4=[0.06,0.11,…,0.35]。
Step 3.3) constructing a free weight semantic fusion method, specifically, randomly initializing two trainable parameters gamma and delta in the step, and performing semantic feature fusion by using the two parameters:
ui=γi⊙hii⊙zi (15)
for example, the local context semantic corresponding to the 4 th word "Tianjin" in the word sequence is represented as h4=[0.02,0.11,…,0.76]The global semantic representation is z4=[0.14,0.09,…,-0.26]. Through model optimization, gamma4=[0.19,0.52,…,-0.11],δi=[-0.22,0.98,…,0.17]Then the fused semantic representation u4=[-0.03,0.15,…,0.13]。
Step 4), sequence labeling: in order to fully consider the dependency relationship among the labels in the sequence labeling task, the CRF is utilized to predict the labels.
Step 4.1) merging semantic feature sequences U ═ U { U } obtained in step 3)1,u2,…,uN}: (in practical application, a mode is selected from the step 3) to fuse the features, and the fusion result is used as the input of the step 4)) to perform full-connection transformation to obtain a state feature matrix P which represents the association between the semantics of each word and the label
P=UWp+bp (16)
In the formula, WpAnd bpAre model parameters.
For example, for the 4 th word "Tianjin" in the sequence of words, its state feature may be p4=[0.01,0.91,…,0.00]。
And 4.2) randomly initializing a transfer characteristic matrix A to represent the transfer relation between the labels, wherein the matrix is optimized through loss back propagation in the model training process.
For example, the transfer feature matrix may be
Figure BDA0002751772910000111
Step 4.3) calculating any one possible label sequence based on the state feature matrix obtained in step 4.1) and the transfer feature matrix obtained in step 4.2)
Figure BDA0002751772910000112
The corresponding score is:
Figure BDA0002751772910000113
based on the score, calculating the probability corresponding to the label sequence:
Figure BDA0002751772910000114
for example, for named entity recognition tasks, the word sequence { "i", "yesterday", "in", "Tianjin", "join", "having", "one", "marathon", "match" } corresponds to the tag sequence
Figure BDA0002751772910000121
The corresponding probability is 0.9.
Step 5) model training: in the model training process, the method maximizes the probability P (Y | X) corresponding to the standard label sequence. Therefore, the present invention optimizes the parameters in steps 1) to 4) in a manner that minimizes the following negative log-likelihood function:
Figure BDA0002751772910000122
step 6) model reasoning: in the practical application process, the invention adopts the Viterbi algorithm to search the optimal label sequence:
Figure BDA0002751772910000123
the foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A sequence labeling method based on a multi-head self-attention mechanism is characterized by comprising the following steps of sequentially executing:
step 1, local context semantic coding, namely learning local context semantic representation of words in a text in a BLSTM serialization manner:
step 1.1, performing word segmentation on an input text to obtain a corresponding word sequence;
step 1.2, for each word in the word sequence, coding character-level vector representation corresponding to each word by using a BLSTM structure;
step 1.3, for each word in the word sequence, splicing the character-level vector representation and the word embedding vector representation coded in the step 1.2 to serve as word initial semantic representation;
step 1.4, based on the word initial semantic representation obtained in step 1.3, using BLSTM to encode the local context semantic representation of each word;
step 2, global semantic coding, namely coding the global semantic representation of the words by utilizing a multi-head self-attention mechanism based on the local context semantic representation of the words coded in the step 1:
step 2.1, mapping the local context semantic representation of the words coded in step 1 to a plurality of different feature subspaces by adopting a full connection layer;
step 2.2, under different feature subspaces obtained in step 2.1, utilizing a self-attention mechanism to encode semantic representation of words;
step 2.3, the semantic representations of the words in each feature subspace calculated in the step 2.2 are spliced, and the splicing result is input into a full connection layer to obtain the global semantic representation corresponding to each word;
and 3, semantic feature fusion, namely constructing the following three feature fusion modes, fusing the local context semantic representation coded in the step 1 and the global semantic representation coded in the step 2, and taking a fusion result as an input semantic feature in the step 4:
step 3.1, constructing a one-dimensional parameter fusion method to realize the linear combination of local context semantics and global semantics;
step 3.2, building a multi-dimensional parameter fusion method by using a gating mechanism adopted in the LSTM for reference;
3.3, constructing a free weight semantic fusion method;
and 4, carrying out sequence annotation, namely predicting the labels by using CRF (fuzzy C) in order to fully consider the dependency relationship among the labels in the sequence annotation task:
step 4.1, performing full-connection transformation on the fused semantic feature sequence obtained in the step 3 to obtain a state feature matrix, and representing the association between the semantics of each word and the label;
step 4.2, initializing a transfer characteristic matrix randomly to express the transfer relation between the labels;
step 4.3, calculating the corresponding score and probability of any possible label sequence based on the state characteristic matrix obtained in the step 4.1 and the transfer characteristic matrix obtained in the step 4.2;
step 5, model training: in the model training process, optimizing the parameters in the steps 1 to 4 by adopting the probability corresponding to the maximized standard label sequence;
step 6, model reasoning: in the practical application process, the optimal label sequence is searched by adopting a Viterbi algorithm, and model reasoning is carried out.
2. The method for labeling sequences based on the multi-head self-attention mechanism as claimed in claim 1, wherein in step 1.1, the Stanford NLP toolkit is used to perform word segmentation on the input text.
3. The method for sequence annotation based on multi-head attention mechanism as claimed in claim 1, wherein in step 1.3, the initial semantic representation of the word
Figure FDA0002751772900000021
Wherein the content of the first and second substances,
Figure FDA0002751772900000022
for the character-level vector representation in question,
Figure FDA0002751772900000023
a vector representation is embedded for the word.
4. The method for labeling sequences based on the multi-head attention mechanism as claimed in claim 1, wherein in step 1.4, the initial semantic representation sequence E ═ { E } based on the words obtained in step 1.3 is used1,e2,...,eNUsing BLSTM to encode each word x in the textiIs used to represent the local context semantic representation hi
Figure FDA0002751772900000024
Figure FDA0002751772900000025
Figure FDA0002751772900000026
5. The method for labeling sequences based on the multi-head attention mechanism as claimed in claim 1, wherein in step 2.1, the word local context encoded in step 1 is usedSemantic representation sequence H ═ H1,h2,...,hNMapping to M different feature subspaces, wherein the mapping mode of the ith feature subspace is as follows:
Figure FDA0002751772900000027
Figure FDA0002751772900000028
Figure FDA0002751772900000029
in the formula (I), the compound is shown in the specification,
Figure FDA00027517729000000210
and
Figure FDA00027517729000000211
is a model parameter; q represents a query in the attention mechanism, K represents a keyword, and V represents a value corresponding to the keyword.
6. The method for sequence annotation based on multi-head attention mechanism as claimed in claim 1, wherein in step 2.2, the semantic representation of the word is encoded by using the self-attention mechanism based on dot product under different feature subspaces obtained in step 2.1:
headi=Attention(Qi,Ki,Vi)
Figure FDA0002751772900000031
in the formula (d)kRepresenting the dimension of the feature in the subspace, and T represents the transpose operation of the matrix.
7. The method for sequence annotation based on multi-head attention mechanism as claimed in claim 1, wherein in step 2.3, the semantic representation head under each feature subspace calculated in step 2.2 is usediAnd splicing, inputting a splicing result into a full-connection layer, and obtaining a global semantic representation sequence Z corresponding to each word:
Z=[head1;head2;...;headM]Wzin the formula, WzAre model parameters.
8. The method for sequence annotation based on the multi-head attention mechanism as claimed in claim 1, wherein in the step 3, the semantic representation after the one-dimensional parameter fusion: u. ofi=(1-βi)·hii·zi
In the formula betai=sigmoid(Wβ[hi;zi]),hiFor local context semantic representation, ziFor global semantic representation, WβIs a model parameter;
and (3) carrying out semantic representation after multi-dimensional parameter fusion: u. ofi=(I-αi)⊙hii⊙zi
In the formula of alphai=sigmoid(Wα[hi;zi]) ,αis a model parameter;
constructing a semantic representation of semantic fusion of free weights: u. ofi=γi⊙hii⊙zi
Where γ and δ are two trainable parameters.
9. The method for sequence annotation based on the multi-head attention mechanism as claimed in claim 1, wherein in step 4.1, the fused semantic feature sequence U ═ { U ═ obtained in step 31,u2,...,uNCarry out full connection transformation to obtain a state feature matrix P, which represents eachAssociation between the semantics of the word and the label:
P=UWp+bp
in the formula, WpAnd bpAre model parameters.
10. The method for sequence annotation based on multi-head attention mechanism as claimed in claim 1, wherein in step 4.3, any one of the possible tag sequence is calculated based on the state feature matrix obtained in step 4.1 and the transition feature matrix obtained in step 4.2
Figure FDA0002751772900000032
The corresponding score is:
Figure FDA0002751772900000033
based on the score, calculating the probability corresponding to the label sequence:
Figure FDA0002751772900000041
in the model training process, the probability P (Y | X) corresponding to the standard label sequence is maximized, and the parameters in the steps 1 to 4 are optimized by adopting a mode of minimizing the following negative log-likelihood function:
Figure FDA0002751772900000042
in step 6, the optimal tag sequence is searched by using the viterbi algorithm:
Figure FDA0002751772900000043
CN202011187198.0A 2020-10-29 2020-10-29 Sequence labeling method based on multi-head self-attention mechanism Pending CN112380863A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011187198.0A CN112380863A (en) 2020-10-29 2020-10-29 Sequence labeling method based on multi-head self-attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011187198.0A CN112380863A (en) 2020-10-29 2020-10-29 Sequence labeling method based on multi-head self-attention mechanism

Publications (1)

Publication Number Publication Date
CN112380863A true CN112380863A (en) 2021-02-19

Family

ID=74576393

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011187198.0A Pending CN112380863A (en) 2020-10-29 2020-10-29 Sequence labeling method based on multi-head self-attention mechanism

Country Status (1)

Country Link
CN (1) CN112380863A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112967112A (en) * 2021-03-24 2021-06-15 武汉大学 Electronic commerce recommendation method for self-attention mechanism and graph neural network
CN112990434A (en) * 2021-03-09 2021-06-18 平安科技(深圳)有限公司 Training method of machine translation model and related device
CN113010685A (en) * 2021-02-23 2021-06-22 安徽科大讯飞医疗信息技术有限公司 Medical term standardization method, electronic device, and storage medium
CN113158051A (en) * 2021-04-23 2021-07-23 山东大学 Label sorting method based on information propagation and multilayer context information modeling
CN113240098A (en) * 2021-06-16 2021-08-10 湖北工业大学 Fault prediction method and device based on hybrid gated neural network and storage medium
CN113378243A (en) * 2021-07-14 2021-09-10 南京信息工程大学 Personalized federal learning method based on multi-head attention mechanism
CN114462406A (en) * 2022-03-01 2022-05-10 中国航空综合技术研究所 Method for acquiring first-appearing aviation keywords based on multi-head self-attention model
CN115796173A (en) * 2023-02-20 2023-03-14 杭银消费金融股份有限公司 Data processing method and system for supervision submission requirements

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109697285A (en) * 2018-12-13 2019-04-30 中南大学 Enhance the hierarchical B iLSTM Chinese electronic health record disease code mask method of semantic expressiveness
CN110457480A (en) * 2019-08-16 2019-11-15 国网天津市电力公司 The construction method of fine granularity sentiment classification model based on interactive attention mechanism
CN111274398A (en) * 2020-01-20 2020-06-12 福州大学 Method and system for analyzing comment emotion of aspect-level user product
CN111767409A (en) * 2020-06-14 2020-10-13 南开大学 Entity relationship extraction method based on multi-head self-attention mechanism
CN111783394A (en) * 2020-08-11 2020-10-16 深圳市北科瑞声科技股份有限公司 Training method of event extraction model, event extraction method, system and equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109697285A (en) * 2018-12-13 2019-04-30 中南大学 Enhance the hierarchical B iLSTM Chinese electronic health record disease code mask method of semantic expressiveness
CN110457480A (en) * 2019-08-16 2019-11-15 国网天津市电力公司 The construction method of fine granularity sentiment classification model based on interactive attention mechanism
CN111274398A (en) * 2020-01-20 2020-06-12 福州大学 Method and system for analyzing comment emotion of aspect-level user product
CN111767409A (en) * 2020-06-14 2020-10-13 南开大学 Entity relationship extraction method based on multi-head self-attention mechanism
CN111783394A (en) * 2020-08-11 2020-10-16 深圳市北科瑞声科技股份有限公司 Training method of event extraction model, event extraction method, system and equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张志昌等: "融合局部语义和全局结构信息的健康问句分类", 《西安电子科技大学学报》 *
王旭强等: "基于注意力机制的特征融合序列标注模型", 《HTTPS://KNS.CNKI.NET/KCMS/DETAIL/37.1357.N.20200619.1603.002.HTML》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113010685B (en) * 2021-02-23 2022-12-06 安徽讯飞医疗股份有限公司 Medical term standardization method, electronic device, and storage medium
CN113010685A (en) * 2021-02-23 2021-06-22 安徽科大讯飞医疗信息技术有限公司 Medical term standardization method, electronic device, and storage medium
CN112990434A (en) * 2021-03-09 2021-06-18 平安科技(深圳)有限公司 Training method of machine translation model and related device
CN112990434B (en) * 2021-03-09 2023-06-20 平安科技(深圳)有限公司 Training method of machine translation model and related device
CN112967112B (en) * 2021-03-24 2022-04-29 武汉大学 Electronic commerce recommendation method for self-attention mechanism and graph neural network
CN112967112A (en) * 2021-03-24 2021-06-15 武汉大学 Electronic commerce recommendation method for self-attention mechanism and graph neural network
CN113158051B (en) * 2021-04-23 2022-11-18 山东大学 Label sorting method based on information propagation and multilayer context information modeling
CN113158051A (en) * 2021-04-23 2021-07-23 山东大学 Label sorting method based on information propagation and multilayer context information modeling
CN113240098A (en) * 2021-06-16 2021-08-10 湖北工业大学 Fault prediction method and device based on hybrid gated neural network and storage medium
CN113378243A (en) * 2021-07-14 2021-09-10 南京信息工程大学 Personalized federal learning method based on multi-head attention mechanism
CN113378243B (en) * 2021-07-14 2023-09-29 南京信息工程大学 Personalized federal learning method based on multi-head attention mechanism
CN114462406A (en) * 2022-03-01 2022-05-10 中国航空综合技术研究所 Method for acquiring first-appearing aviation keywords based on multi-head self-attention model
CN115796173A (en) * 2023-02-20 2023-03-14 杭银消费金融股份有限公司 Data processing method and system for supervision submission requirements

Similar Documents

Publication Publication Date Title
CN112380863A (en) Sequence labeling method based on multi-head self-attention mechanism
CN109992782B (en) Legal document named entity identification method and device and computer equipment
CN108416058B (en) Bi-LSTM input information enhancement-based relation extraction method
CN106502985B (en) neural network modeling method and device for generating titles
Yao et al. Bi-directional LSTM recurrent neural network for Chinese word segmentation
CN111666427B (en) Entity relationship joint extraction method, device, equipment and medium
CN111767409B (en) Entity relationship extraction method based on multi-head self-attention mechanism
CN111401084B (en) Method and device for machine translation and computer readable storage medium
CN112541356B (en) Method and system for recognizing biomedical named entities
CN114298053B (en) Event joint extraction system based on feature and attention mechanism fusion
CN112100332A (en) Word embedding expression learning method and device and text recall method and device
CN113987169A (en) Text abstract generation method, device and equipment based on semantic block and storage medium
CN111368542A (en) Text language association extraction method and system based on recurrent neural network
CN111881256B (en) Text entity relation extraction method and device and computer readable storage medium equipment
CN116661805B (en) Code representation generation method and device, storage medium and electronic equipment
CN110874536A (en) Corpus quality evaluation model generation method and bilingual sentence pair inter-translation quality evaluation method
CN112163089A (en) Military high-technology text classification method and system fusing named entity recognition
CN116932722A (en) Cross-modal data fusion-based medical visual question-answering method and system
Xu et al. Match-prompt: Improving multi-task generalization ability for neural text matching via prompt learning
CN115169349A (en) Chinese electronic resume named entity recognition method based on ALBERT
CN113076718B (en) Commodity attribute extraction method and system
CN115019142A (en) Image title generation method and system based on fusion features and electronic equipment
CN116680575B (en) Model processing method, device, equipment and storage medium
CN113901218A (en) Inspection business basic rule extraction method and device
CN112633007A (en) Semantic understanding model construction method and device and semantic understanding method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210219

RJ01 Rejection of invention patent application after publication