CN112651245A

CN112651245A - Sequence annotation model and sequence annotation method

Info

Publication number: CN112651245A
Application number: CN202011577267.9A
Authority: CN
Inventors: 王进; 章韵
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2020-12-28
Filing date: 2020-12-28
Publication date: 2021-04-13

Abstract

The invention provides a sequence labeling model and a sequence labeling method. When the model is used for carrying out sequence labeling tasks, firstly ELMo word vectors are added into an input layer to serve as extra features, the representation of each character is the splicing of the character vectors and the ELMo representation, secondly, in a BilSTM network layer, in addition to the fact that a forward LSTM network is used for learning the historical features of each character, the sequence is also input into a reverse LSTM network in a reverse order for learning the subsequent features of each character, the context features of the characters are spliced and input into a CRF layer, and finally, a conditional random field is used for carrying out combined modeling to obtain a globally optimal label sequence. The method provided by the invention obtains good performance on data sets identified by Chinese named entities, such as Boson, LDC2009 and the like, and the average performance F1 value is improved by 4.95%.

Description

Sequence annotation model and sequence annotation method

Technical Field

The invention relates to a sequence labeling model and a sequence labeling method, and belongs to the field of computer application and natural language processing.

Background

Named Entity Recognition (NER), also called "proper name Recognition", refers to recognizing entities with specific meaning in text, mainly including names of people, places, organizations, proper nouns, etc. Early named entity recognition is mostly based on rule method, but because the language structure itself has uncertainty, the difficulty of making unified and complete rule is large. The rule-based method requires the construction of a specific rule template, the adopted characteristics include statistical information, punctuation marks, keywords, position words, central words and the like, the matching of patterns and character strings is taken as a main means, and the method particularly depends on the establishment of a knowledge base and a dictionary. Aiming at different fields, an expert is required to rewrite rules, the cost is high, the rule establishment period is long, the portability is poor, knowledge bases in different fields are required to be established as assistance to improve the system identification capability and the like.

The traditional named entity recognition method mostly adopts supervised machine learning models, such as hidden Markov models, maximum entropy, support vector machines, conditional random fields and the like. The maximum entropy model has the characteristics of strict structure and good universality, but the training time is high in complexity, and the calculation overhead is large due to the fact that clear normalization calculation is needed. The conditional random field is excellent in word segmentation and named entity recognition, a labeling framework with flexible characteristics and global optimization is provided, and the problems of low convergence speed and long training time exist at the same time. The statistical-based method has high dependency on the selection of the features, the features with large influence factors on the task need to be analyzed and selected from the text, the features are added into the feature template, effective feature selection is carried out by counting and analyzing the language semantic information contained in the training corpus, and strong features are continuously found from the training corpus. In the work of word2vec in 2013 and GloVe in 2014, each word corresponds to a vector, and the vector cannot be used for the ambiguous word.

In view of the above, it is necessary to provide a sequence annotation model to solve the above problems.

Disclosure of Invention

The invention aims to provide a sequence annotation model for improving the performance of named entity recognition.

In order to achieve the above object, the present invention provides a sequence annotation model, wherein the sequence annotation model adopts a basic framework of a BilSTM-CRF model and adds word vectors of a pre-trained language model ELMo as additional features, and the sequence annotation model comprises:

an input layer: for inputting a sentence consisting of n characters (w)₁w₂...w_n) Mapping each character in the sentence into a vector sequence by querying a word vector table, and introducing an ELMo word vector as an additional feature into the input layer;

BilSTM network layer: the device comprises a forward long-short time memory network LSTM and a backward long-short time memory network LSTM, wherein the forward long-short time memory network LSTM is used for representing a backward-forward calculation sequence, the backward long-short time memory network LSTM is used for representing a backward calculation same sequence, and a BiLSTM network layer is used for receiving a vector sequence obtained from an input layer and taking the vector sequence as input so as to obtain the context characteristics of each character;

CRF layer: for receiving the output of the BiLSTM network layer and introducing a transition matrix for global optimal decoding of the vector sequence of the entire sentence.

As a further refinement of the present invention, the sentences input into the input layer contain a variable number of characters.

As a further improvement of the present invention, the language model ELMo includes an english ELMo model in which one english word representation includes a word vector composed of english words and a word representation obtained by convolving characters in english through a convolutional neural network; in the Chinese ELMo model, the representation of the character is directly encoded.

As a further improvement of the invention, the language model ELMo is a bidirectional LSTM language model, and the bidirectional LSTM language model comprises a forward language model LSTM and a backward language model LSTM; after pre-training the bi-directional LSTM language model, the language model ELMo is based on the formula:

is expressed as a word, wherein

A vector representing the individual character itself,

and

representing the input of the forward language model LSTM and the input of the backward language model LSTM, respectively, to compute 2L +1 tokens.

The invention also aims to provide a sequence labeling method, which is used for better realizing the sequence labeling model.

In order to achieve the above object, the present invention provides a sequence labeling method, which is applied to the above sequence labeling model, and specifically includes the following steps:

step 1, a characteristic representation stage: the input layer sets (w) the input sentence content through a random word vector or a word vector table initialized with a pre-trained word vector₁w₂...w_n) Character w in_kMapping to a sequence of vectors (v)₁v₂...v_n) While introducing the word vector of ELMo as an additional feature, so that the character representation of each character is a concatenation of its character vector and the word vector representation of ELMo, i.e., w_t＝[v_t,e_t],t∈[1,n]Then the vector sequence of the whole sentence input is (v)₁v₂...v_n)；

Step 2, an encoding stage: vector sequence (v) for the entire sentence output by the feature representation stage₁v₂...v_n) In the forward language model LSTM, given (w)₁w₂...w_k-1) Under the condition of (1), obtaining w_kThen, the probability of the vector sequence of the whole sentence is obtained through a probability formula, a target function is obtained by combining the formula calculation of a backward language model LSTM, and then, the bidirectional LSTM language model is adopted to further realize coding so as to obtain the characteristic vector h of each character in the whole sentence_t；

Step 3, decoding stage: defining a transition matrix A_ijTo represent a score from label i to label j, based on the feature vector h of each character_tCalculate the score for each label: o_t＝W_oh_t+b_oThen, the vector sequence (v) of the whole sentence is calculated₁v₂...v_n) Then transfer matrix A by reference_ijTo select the globally most likely tag sequence.

As a further improvement of the present invention, the probability formula in step 2 is:

as a further improvement of the present invention, the formula of the backward language model LSTM in step 2 is:

as a further improvement of the present invention, the objective function in step 2 is:

wherein theta is_xAnd theta_sThe parameters of the word vector of the ELMo and the parameters of the softmax layer of the convolutional neural network, respectively.

As a further improvement of the present invention, the formula encoded by the bidirectional LSTM language model in step 2 includes: i.e. i_t＝σ(W_iix_t+b_ii+W_hih_t-1+b_hi)

f_t＝σ(W_ifx_t+b_if+W_hfh_t-1+b_hf)

g_t＝tanh(W_ig+b_ig+W_hgh_t-1+b_hg)

o_t＝σ(W_iox_t+b_io+W_hoh_t-1+b_ho)

Where W and b are parameters in the bi-directional LSTM language model, σ is a sigmoid function,

is an element-by-element multiplication, i_t，f_t，o_tInput, forget and output gates indicating time t, C_t，h_t，g_tIndicating the cell state, output state and new state at time t.

As a further improvement of the invention, the vector sequence (v) of the whole sentence in step 3 is₁v₂...v_n) The score of (a) is:

definition of y₀And y_n+1Is the beginning and ending tags of a sentence, the probability of the vector sequence of the entire sentence is:

the log probability of maximizing the correct tag sequence is:

wherein Y is_xRepresents all possible tag sequences; by passing

The prediction outputs the most likely tag sequence.

The invention has the beneficial effects that: the sequence labeling model of the invention has the advantages of accelerating the recognition convergence speed of the named entity, shortening the training time, improving the performance of the sequence labeling model for the named entity recognition and improving the accuracy of the named entity recognition.

Drawings

FIG. 1 is a schematic structural diagram of a sequence annotation model according to the present invention.

Fig. 2 is a schematic structural view of the ELMo model in fig. 1.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

The invention provides a sequence annotation model which is based on fine-grained named entity recognition. The method is characterized in that a pre-trained language model ELMo is added into a named entity recognition basic framework of the BilSTM-CRF as a characteristic, on the basis of large-scale unsupervised data, the performance of the named entity recognition model is improved by using pre-trained context-based word vectors, and the accuracy of the named entity recognition is improved by combining methods of natural language processing and machine learning.

As shown in fig. 1, the entire model includes: input layer, BilSTM network layer, and CRF layer.

An input layer: for inputting a sentence consisting of n characters (w)₁w₂...w_n) And mapping each character in the sentence into a vector sequence by querying the word vector table, and introducing the word vector of the ELMo as an additional characteristic into the input layer. The sentences input into the input layer contain a variable number of characters. Specifically, characters contained in the sentence are mapped into a vector sequence through a word vector table, ELMo word vectors are added into an input layer, and the representation of each character is the concatenation of the word vectors and the ELMo representation and is input into a BilSTM layer.

BilSTM network layer (Bi-directional Long Short-Term Memory): the bidirectional long and short term memory neural network layer comprises a forward long and short term memory network LSTM and a backward long and short term memory network LSTM, and the forward long and short term memory network LSTM is used as a backward-forward calculation sequenceThe backward long-and-short time memory network LSTM is used for calculating the representation of the same sequence in a backward direction, and the BilSTM network layer is used for receiving the vector sequence obtained in the input layer as input so as to obtain the context characteristics of each character. Specifically, the LSTM model is formed by the input word w at time t_tCell state C_tTemporary cell state

Hidden layer state h_tForgetting door f_tMemory door i_tOutput gate o_tAnd (4) forming. The calculation process of the LSTM can be summarized as that information useful for the calculation at the subsequent moment is transmitted by forgetting and memorizing new information in the cell state, useless information is discarded, and a hidden layer state h is output at each time step_tWherein the forgetting, memorizing and outputting are based on the hidden layer state h passing the last moment_t-1And the current input w_tCalculated forgetting door f_tMemory door i_tOutput gate o_tTo control.

CRF layer: for receiving the output of the BiLSTM network layer and introducing a transition matrix for global optimal decoding of the vector sequence of the entire sentence. In particular, the CRF layer may add some constraints to the last predicted tag to ensure that the predicted tag is legitimate. These constraints may be learned automatically by the CRF layer during training of the training data. And the CRF layer receives the output score of the BilSTM layer as input, adds a transfer score matrix and selects a globally optimal label sequence according to the score.

The language model ELMo includes an english ELMo model in which one english word representation includes a word representation formed by convolving word vectors of english words and characters in english by CNN, and a chinese ELMo model; in the Chinese ELMo model, the representation of the character is directly encoded.

As shown in fig. 2, the language model ELMo is a bidirectional LSTM language model, which includes a forward language model LSTM and a backward language model LSTM, and the objective function is the maximum likelihood of the two directional language models. In advance trainingAfter the bi-directional LSTM language model, the language model ELMo follows the formula:

as a word representation, which is a summation of each intermediate layer of the bi-directional language model, the simplest representation of the highest layer can be used as ELMo. Then when a supervised NLP task is performed, ELMo is spliced as a feature to the word vector input of a specific task model or the representation of the highest layer of the model. Unlike traditional word vectors, each word corresponds to only one word vector, ELMo utilizes a pre-trained bi-directional language model, and then can obtain context-dependent current word representations from the language model according to specific inputs, i.e., representations of the same word with different contexts are different, and then are added into a specific NLP supervised model as features.

In order to better realize the sequence labeling model, the invention also provides a sequence labeling method, which specifically comprises the following steps:

step 1, a characteristic representation stage: the input layer sets (w) the input sentence content through a random word vector or a word vector table initialized with a pre-trained word vector₁w₂...w_n) Is mapped to a vector sequence (v)₁v₂...v_n) While introducing the word vector of ELMo as an additional feature, so that the character representation of each character is a concatenation of its character vector and the word vector representation of ELMo, i.e., w_t＝[v_t,e_t],t∈[1,n]Then the vector sequence of the whole sentence input is (v)₁v₂...v_n)；

Step 2, an encoding stage: vector sequence (v) for the entire sentence output by the feature representation stage₁v₂...v_n) In the forward language model LSTM, given (w)₁w₂...w_k-1) Under the condition of finding each character w_kThen calculating the probability of the vector sequence of the whole sentence by a probability formula, obtaining a target function by combining the formula of a backward language model, and then adopting a bidirectional LSTM languageThe model is coded to obtain a feature vector h of each character in the whole sentence_t；

Step 3, decoding stage: defining a transition matrix A_ijWhich represents the score from label i to label j, based on each character feature vector h_tCalculate the score for each label: o_t＝W_oh_t+b_oFurther calculate the whole sentence vector sequence (v)₁v₂...v_n) Then selects the globally most likely tag sequence by referring to the transition matrix.

In step 2, in particular, for a sequence of n characters (w)₁w₂...w_n)，w_kCan be given by (w) in the forward language model₁w₂...w_k-1) The probability of the whole sentence sequence can be calculated by establishing a model according to the following conditions:

the backward language model is similar to the forward language model, and only the input sequence needs to be inverted, i.e. the context is predicted by:

since the bi-directional language model consists of a forward language model and a backward language model, the goal of model optimization is to maximize the sum of forward and backward language model probabilities:

wherein theta is_xAnd theta_sThe parameters of the word vector of the ELMo and the parameters of the softmax layer of the convolutional neural network, respectively. ELMo is a combination of BilsTM network layers, for each character w_kAn L-level bi-directional language model can yield 2L +1 tokens:

wherein

A vector representing the individual character itself,

and

representing the input of the forward language model LSTM and the input of the backward language model LSTM, respectively, to compute 2L +1 tokens. To apply ELMo to the Chinese NER task, all layers of R are collapsed into a single vector, E_k＝E(R_k；θ_e) So ELMo provides 2L +1 tokens for each entered character. Secondly, in the BilSTM layer, the calculation method is as follows:

i_t＝σ(W_iix_t+b_ii+W_hih_t-1+b_hi)

f_t＝σ(W_ifx_t+b_if+W_hfh_t-1+b_hf)

g_t＝tanh(W_ig+b_ig+W_hgh_t-1+b_hg)

o_t＝σ(W_iox_t+b_io+W_hoh_t-1+b_ho)

where W and b are parameters in the LSTM cell, σ is the sigmoid function,

is an element-by-element multiplication, i_t，f_t，o_tInput gate, forget gate, and output gate representing time t, C_t，h_t，g_tCell state, output state and new state at time t. Given a sentence sensor (w) containing n characters₁w₂...w_n) Splicing the outputs of the forward LSTM and the backward LSTM to obtain a feature vector representation h of the character at the time t_t＝[h_lt,h_rt]。

In step 3, in particular in the CRF layer, h is not used_tOutput o as a feature pair_tIndependent label prediction is performed, but joint modeling is performed using CRF. According to character characteristics h_tCalculate the score for each label: o_t＝W_oh_t+b_oDefining a transition matrix A_ijRepresenting the score from tag i to tag j, the vector sequence (v) of the entire sentence₁v₂...v_n) Is scored as

y₀And y_n+1Are the beginning and ending tags of a sentence. The probability of the vector sequence for the entire sentence is:

in the training process, the log probability of the correct tag sequence is maximized:

wherein Y is_xRepresenting all possible tag sequences. When decoding, by

The prediction outputs the most likely tag sequence.

In summary, the invention provides a sequence tagging model, and improves the performance of the sequence tagging model for named entity recognition by using a sequence tagging method, thereby improving the accuracy of named entity recognition. The model is added with ELMo work, the model is different from the prior model that one word corresponds to one vector, and in the ELMo world, the pre-trained model is not only the vector corresponding relation but also a trained model. When the method is used, a sentence or a segment of a sentence is input into the model, the model infers the word vector corresponding to each word according to the online text, so that the polysemous words can be understood by combining the context before and after the polysemous words.

Although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the spirit and scope of the present invention.

Claims

1. A sequence annotation model, characterized in that the basic framework adopted by the sequence annotation model is a BilSTM-CRF model, and word vectors of a pre-trained language model ELMo are added as additional features, and the sequence annotation model comprises:

2. The sequence annotation model of claim 1, wherein: the sentences input into the input layer contain a variable number of characters.

3. The sequence annotation model of claim 1, wherein: the language model ELMo comprises an English ELMo model and a Chinese ELMo model, wherein in the English ELMo model, an English word expression comprises a word vector of the English word and a word expression for performing convolution on characters in English through a convolution neural network; in the Chinese ELMo model, the representation of the character is directly encoded.

4. The sequence annotation model of claim 1, wherein: the language model ELMo is a bidirectional LSTM language model, and the bidirectional LSTM language model comprises a forward language model LSTM and a backward language model LSTM; after pre-training the bi-directional LSTM language model, the language model ELMo is based on the formula:

is expressed as a word, wherein

A vector representing the individual character itself,

and

5. A sequence annotation method is applied to the sequence annotation model of any one of claims 1 to 4, and comprises the following specific steps:

step 1, a characteristic representation stage: the input layer sets (w) the input sentence content through a random word vector or a word vector table initialized with a pre-trained word vector₁w₂...w_n) Character mapping ofIs emitted as a sequence of vectors (v)₁v₂...v_n) While introducing the word vector of ELMo as an additional feature, so that the character representation of each character is a concatenation of its character vector and the word vector representation of ELMo, i.e., w_t＝[v_t,e_t],t∈[1,n]Then the vector sequence of the whole sentence input is (v)₁v₂...v_n)；

Step 2, an encoding stage: vector sequence (v) for the entire sentence output by the feature representation stage₁v₂...v_n) In the forward language model LSTM, (v) is given₁v₂...v_k-1) Under the condition of (1), obtaining w_kThen, the probability of the vector sequence of the whole sentence is obtained through a probability formula, a target function is obtained by combining the formula calculation of a backward language model LSTM, and then the bidirectional LSTM language model is adopted for coding to obtain the characteristic vector h of each character in the whole sentence_t；

6. The sequence annotation method of claim 5, wherein the probability formula in step 2 is:

7. the sequence annotation method of claim 6, wherein the formula of the backward language model LSTM in step 2 is:

8. the sequence annotation method of claim 7, wherein the objective function in step 2 is:

9. The sequence annotation method of claim 8, wherein the formula encoded in step 2 by using the bi-directional LSTM language model comprises:

i_t＝σ(W_iix_t+b_ii+W_hih_t-1+b_hi)

f_t＝σ(W_ifx_t+b_if+W_hfh_t-1+b_hf)

g_t＝tanh(W_ig+b_ig+W_hgh_t-1+b_hg)

o_t＝σ(W_iox_t+b_io+W_hoh_t-1+b_ho)

10. The sequence annotation process of claim 9, wherein the vector sequence (v) of the entire sentence in step 3 is₁v₂...v_n) The score of (a) is:

the log probability of maximizing the correct tag sequence is:

wherein Y is_xRepresents all possible tag sequences; by passing

The prediction outputs the most likely tag sequence.