CN111209738A

CN111209738A - Multi-task named entity recognition method combining text classification

Info

Publication number: CN111209738A
Application number: CN201911417834.1A
Authority: CN
Inventors: 庄越挺; 浦世亮; 汤斯亮; 纪睿; 王凯; 吴飞
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2019-12-31
Filing date: 2019-12-31
Publication date: 2020-05-29
Anticipated expiration: 2039-12-31
Also published as: CN111209738B

Abstract

The invention discloses a multi-task named entity recognition method combining text classification. The method comprises the following steps: (1) constructing a text classifier by using a convolutional neural network, and measuring the similarity of texts; (2) selecting a proper threshold, and determining whether the auxiliary task data set participates in the update of the shared layer parameters according to the comparison between the text classification result and the threshold; (3) cascading character vectors of the text and pre-trained word vectors to serve as input feature vectors; (4) in a sharing layer, modeling an input feature vector of each word in a sentence by using a bidirectional LSTM, and learning common features of each task; (5) training each task in turn on the task layer, transmitting the output of the sharing layer to the bidirectional LSTM neural network in the main task private layer or the auxiliary task private layer, then using the linear chain random field to decode the label of the whole sentence, and labeling the entity in the sentence. The invention performs experiments on data sets in multiple biomedical fields, and can effectively improve the named entity recognition effect in a specific field with difficult acquisition of language materials and high labeling cost.

Description

Multi-task named entity recognition method combining text classification

Technical Field

The invention relates to natural language processing, in particular to a multitask named entity recognition method combining text classification.

Background

Natural Language Processing (NLP) is a cross discipline integrating linguistics and computer disciplines. Named Entity Recognition (NER) is a basic task in natural language processing, and aims to recognize proper nouns and meaningful quantitative phrases in natural language texts and classify the proper nouns and meaningful quantitative phrases. With the rise of information extraction and big data concepts, named entity recognition tasks are increasingly emphasized by people and become important components of natural language processing such as public opinion analysis, information retrieval, automatic question answering and machine translation. How to automatically, accurately and quickly identify named entities from massive internet text information gradually becomes a hot problem concerned by academia and industry.

Named entity recognition techniques, which aim to identify entity text and categories in documents in a particular domain (e.g., biomedical), have become an important component of document classification, retrieval, and content analysis in a particular domain. Taking the biomedical field as an example, while the number of biomedical documents, clinical records, etc. is growing at a high rate, there is also a high rate of growth of new biomedical entities and their acronyms, synonyms. However, existing named entity recognition systems based on learning rely heavily on labeling data which requires high cost, and in the biomedical field, professional domain knowledge is required to label data. How to utilize the published data set without additional manual labeling of a new data set has become a research focus at present.

The neural network model is a mainstream entity recognition technology at present for recognizing named entities in texts, however, such a learning model often needs a large amount of labeled data for training. Neural network models often perform very poorly due to the lack of training data in the biomedical field.

Aiming at the difficulty in the prior art, a multi-task named entity recognition method for joint text classification in a specific field is provided. Although data is often limited for a particular domain, there is often some data for related domains. For example, in the biomedical field, there are related field data sets, such as disease data sets, drug data sets, species data sets, and the like. The purpose of the method study is to utilize the data to help the target task improve the effect. The method is based on the assumption that two data sets should overlap in semantic space if they can facilitate each other or the target task. When the overlapped part of the two data sets is close in semanteme, namely a target task is trained, sentences close to the target task in the auxiliary task are trained, and sentences with non-close semantemes are not trained. The used frame is multi-task learning, and if the sentences of the auxiliary task are close to the target task semantics, the sharing layer and the task layer are updated; otherwise, only the task layer is updated. Experiments are carried out on a plurality of data sets in biomedicine and related fields, and the effect of a target task can be effectively improved under most conditions.

Disclosure of Invention

The invention aims to utilize the data sets of related fields to help the target field to improve the effect under the background that new data sets do not need to be additionally labeled, and provides a multi-task named entity recognition method aiming at joint text classification of a specific field.

The technical scheme adopted by the invention is as follows:

a multitask named entity recognition method combining text classification comprises the following steps:

s1: constructing a text classifier by using a convolutional neural network, and measuring the similarity of texts;

s2: selecting a threshold, and determining whether the auxiliary task data set participates in updating of the shared layer parameters according to the comparison between the text classification result and the threshold;

s3: cascading character vectors of the text and pre-trained word vectors to serve as input feature vectors;

s4: in a sharing layer, modeling an input feature vector of each word in a sentence by using a bidirectional LSTM, and learning common features of each task;

s5: training each task in turn on the task layer, transmitting the output of the sharing layer to the bidirectional LSTM neural network in the main task private layer or the auxiliary task private layer, then using the linear chain random field to decode the label of the whole sentence, and labeling the entity in the sentence.

The steps can be realized in the following way:

in step S1, a text classifier is constructed by using a convolutional neural network, and the specific steps of measuring the similarity of the text are as follows:

s11: inputting each word in a sentence, and converting the word into a word vector with a dimension of k through a word embedding module; let the word vector of the ith word in the sentence

If the sentence length is n, the sentence is represented as:

x_1:n＝[x₁；x₂；…；x_n](1)

s12: let the convolution kernel be

At window x_i:i+h-1The upper convolution calculation obtains the characteristic c_i：

c_i＝f(w·x_i:i+h-1+b) (2)

Where h × k is the dimension of the convolution kernel, and b represents the bias;

the sentence-wise constructed features of length n are:

c＝[c₁；c₂；…；c_n-h+1](3)

s13: performing maximum pooling on c

Corresponding feature expressions as convolution kernel w:

s14: using a plurality of convolution kernels w₁,w₂,…，w_sRespectively performing the above operations to express the obtained corresponding characteristics

Splicing, inputting into a fully connected network, and classifying by using a Softmax function; the Softmax function is defined as follows:

wherein, being the input of the Softmax function, V_iAn ith element representing an input vector; s is the output of the Softmax function, S_iAnd an ith element representing an output vector represents the probability that the input sentence belongs to an ith category, and the number of the categories is M.

In step S2, a threshold is selected, and for the data set of the auxiliary task, the specific step of determining whether to participate in updating the shared layer parameter according to the comparison between the text classification result and the threshold is as follows:

s21: setting m data sets, wherein the first data set is set as a main task, and the rest m-1 data sets are auxiliary tasks;

s22: after the training of the text classifier is completed, each sentence is subjected to text classificationThe machine will generate 1 vector as

The 1 st digit of the vector is denoted as k₀Each dataset takes k of all sentences₀As a threshold for the data set;

s23: when the multi-task named entity recognition model is trained, the data of the main task is updated to the sharing layer by default;

s24: the data of the auxiliary task passes through a text classifier, and when the text is classified and output, k₀And if the value is larger than the threshold value, updating the task layer and the sharing layer, otherwise, only updating the task layer.

In step S3, the step of concatenating the character vector of the text and the pre-trained word vector as the input feature vector is as follows:

s31: the method comprises the steps that a natural language processing tool is adopted to perform sentence segmentation and word segmentation on a document, and the sentences, words and labels are counted to form a sentence table, a word table and a label table; counting characters in the word list to form a character list;

s32: let C be the character table, d be the dimension of each character vector, and the character vector matrix be:

s33: let the vector of the ith character of the word t be

The word is denoted t_1:l＝[t₁；t₂；…；t_l]Where l is the length of the word t;

s34: using a kernel of height h

Realizing convolution, adding bias value b, then making nonlinear regression on the whole convolution result to implement characteristic mapping, and mapping function f^tThe ith element f^t(i) Is given by formula (6);

f^t(i)＝tanh(w·t_i:i+h-1+b) (6)

s35: with y^t＝max_if^t(i) A feature expression corresponding to a convolution kernel w as a word t;

s36: using a plurality of convolution kernels w₁，w₂,…,w_qRespectively performing the above operations to express the obtained corresponding characteristics

And splicing the words, and then cascading the words with the word vectors pre-trained by the words t to be used as the input feature vectors of the t.

In step S4, in the sharing layer, the specific steps of modeling the input feature vector of each word in the sentence by using the bidirectional LSTM and learning the common features of each task are as follows:

s41: definition of x_tIs the input feature vector at time t, h_tFor the hidden layer state vector to store all useful information at time t, σ is sigmoid regression layer, and x is inner product, U_i,U_f,U_c,U_oFor input x in different states_tWeight matrix of W_i,W_f,W_c,W_oIs a hidden layer state h_tWeight matrix of b_i,b_f,b_c,b_oIs a bias vector;

s42: the calculation of the forget gate at time t is shown in equation (7):

f_t＝σ(W_fh_t-1+U_fx_t+b_f) (7)

f_tdetermining the proportion of the unit state needing to be forgotten at the time of t-1;

s43: and updating the information in the cell state required to be stored until the time t, wherein the calculation formulas are shown as (8) and (9):

i_t＝σ(W_ih_t-1+U_ix_t+b_i) (8)

wherein

To be added to the candidate vector for the cell state at time t, i_tDetermining

A storable proportion;

s44: combining the calculation results of the first two steps together to generate a new cell state, wherein the calculation formula is shown as an expression (10):

C_tcell state at time t;

s45: the output at time t is calculated, and the calculation formulas are shown as (11) and (12):

o_t＝σ(W_oh_t-1+U_ox_t+b_o) (11)

h_t＝o_t*tanh(C_t) (12)

wherein o is_tDetermining the proportion of the unit state which can be used as output at the time t; h is_tA hidden layer vector representing time t as output information of time t;

s46: hidden layer information h in the above step_tStoring all the past time information, and setting a hidden layer information g by the same method_tFor storing future information, the last two hidden layer information are concatenated to form the final output vector.

In step S5, each task is trained in turn at the task layer, the output of the shared layer is transmitted to the bidirectional LSTM neural network in the main task private layer or the auxiliary task private layer, the entire sentence is tag-decoded by using the linear chain random field, and the entity in the sentence is labeled as follows:

s51: the output of the sharing layer is used as input and is transmitted into a bidirectional LSTM private layer of a main task or an auxiliary task, and then the output of the bidirectional LSTM private layer is used as the input of a conditional random field;

s52: with z ═ z₁,z₂,…,z_nDenotes an input sequence of conditional random fields, where n is the length of the input sequence and z is the length of the input sequence_iIs the input vector of the ith word, y ═ y₁,y₂,…,y_nY (z) ═ y'₁,y′_n,…,y′_nZ is all possible output label sequences;

s53: for tag sequence y, its score is defined as:

wherein A is a transition score matrix, A_j,kRepresents the score of the transition from label j to label k; p is a fractional matrix of the output of the previous layer network, P_j,kA score of a kth tag corresponding to a jth word;

s54: for an input sequence z, the probability that its tag sequence is y is defined as:

in the training process, the logarithmic probability of the correct sequence label is maximized;

s55: at the time of final decoding, the sequence y with the highest score is searched^*As a final output sequence, as shown in equation (15):

。

compared with the prior art, the invention has the beneficial effects that:

1. the invention provides a multi-task named entity recognition method for joint text classification in a specific field. Aiming at the problem that a specific field (such as a biomedical field) is lack of labeled data, the method fully utilizes the theoretical knowledge of multi-task learning and explores and utilizes a related field data set to improve the named entity identification accuracy of the target field.

2. The method combines a text classification model to measure the relevance between the related field data and the target task, the related field data with high relevance to the target task participates in the update of the shared layer parameters, and the data with low relevance only participates in the update of the self task layer parameters. Therefore, irrelevant data are prevented from interfering the training of the target task, and the relevant data are effectively utilized to improve the effect of the target task.

Drawings

FIG. 1 is a schematic diagram of a text classification model based on a convolutional neural network;

FIG. 2 is a schematic diagram of a bi-directional LSTM neural network;

FIG. 3 is a block diagram of a method for multi-tasking named entity recognition for federated text classification;

FIG. 4 is a training flow of a method for multi-task named entity recognition with joint text classification.

Detailed Description

The invention is further described with reference to the following figures and detailed description.

The invention mainly realizes a multi-task named entity recognition method for joint text classification in a specific field. Aiming at the problem that a specific field (such as a biomedical field) is lack of labeled data, the method fully utilizes the theoretical knowledge of multi-task learning and explores and utilizes a related field data set to improve the named entity identification accuracy of the target field. The invention adopts the text classification model based on the convolutional neural network shown in figure 1 to measure the relevance of the related field data and the target task. The result of the character feature vector and the word vector after being cascaded is input into the bidirectional LSTM neural network shown in FIG. 2, and then input into the task layer of the main task or the auxiliary task, and the overall framework of the multitask model is shown in FIG. 3.

The invention discloses a multi-task named entity recognition method based on combined text classification, which comprises the following specific steps:

s1: and constructing a text classifier by using a convolutional neural network, and measuring the similarity of the text.

In this embodiment, the sub-steps of specifically implementing S1 are as follows:

If the sentence length is n, the sentence is represented as:

x_1:n＝[x₁；x₂；…；x_n](1)

s12: let the convolution kernel be

c_i＝f(w·x_i:i+h-1+b) (2)

the sentence-wise constructed features of length n are:

c＝[c₁；c₂；…；c_n-h+1](3)

s13: performing maximum pooling on c

Corresponding feature expressions as convolution kernel w:

s14: using a plurality of convolution kernels w₁,w₂,…,w_sRespectively performing the above operations to express the obtained corresponding characteristics

S2: and selecting a proper threshold, and determining whether the auxiliary task data set participates in the update of the shared layer parameters according to the comparison between the text classification result and the threshold.

In this embodiment, the sub-steps of specifically implementing S3 are as follows:

s22: after the training of the text classifier is completed, each sentence generates 1 vector through the text classifier and records the vector as

S3: and cascading character vectors of the text and pre-trained word vectors to serve as input feature vectors.

s32: let C be the character table, d be the dimension of each character vector, the character vector matrixComprises the following steps:

s33: let the vector of the ith character of the word t be

s34: using a kernel of height h

f^t(i)＝tanh(w·t_i:i+h-1+b) (6)

s36: using a plurality of convolution kernels w₁,w₂,…,w_qRespectively performing the above operations to express the obtained corresponding characteristics

S4: in the sharing layer, the input feature vector of each word in the sentence is modeled by using bidirectional LSTM, and the common features of all tasks are learned.

In this embodiment, the sub-steps of specifically implementing S4 are as follows:

s42: the calculation of the forget gate at time t is shown in equation (7):

f_t＝σ(W_fh_t-1+U_fx_t+b_f) (7)

i_t＝σ(W_ih_t-1+U_ix_t+b_i) (8)

wherein

A storable proportion;

C_tcell state at time t;

o_t＝σ(W_oh_t-1+U_ox_t+b_o) (11)

h_t＝o_t*tanh(C_t) (12)

In this embodiment, the sub-steps of specifically implementing S5 are as follows:

s53: for tag sequence y, its score is defined as:

。

the method is applied to the embodiment, the specific steps and the parameter definitions are as described above, and some contents are not repeated again, and the embodiment mainly shows the specific implementation and technical effects thereof.

Examples

Taking 3 public data sets (BioNLP13CG, BioNLP13PC and CRAFT) of cell component groups in the biomedical field as an example, the method is applied to the 3 data sets for named entity identification, and specific parameters and practices in each step are as follows: training a text classifier:

1. each word in the input sentence is converted into a word vector of dimension 128 by the word embedding module. A sentence of length n may be represented as_1:n＝[x₁；x₂；…；x_n]；

2. The convolution kernel uses three sizes of 3, 4 and 5, and 100 sentences with the length of n are respectively used for constructing the feature which is recorded as c;

3. maximum in pooled level selection features

4. And splicing all the features, inputting the spliced features into a fully-connected network, and classifying by using a Softmax function so as to construct a text classifier. When the text classifier is trained, the batch size is 64, the dropout is 0.5, and the initial learning rate is set to be 0.001;

selecting a proper threshold value:

5.for example, the named entity recognition task of BioNLP13CG is used as a main task, and the other two tasks are used as auxiliary tasks; after the training of the text classifier is completed, each sentence of the data sets BioNLP13PC and CRAFT generates 1 vector through the text classifier and records the vector as 1 vector

The 1 st digit of the k vector is denoted as k₀. The two data sets respectively take k of all sentences₀The average value of (a) is used as a threshold value;

6. during multitask model training, the data of the BioNLP13CG updates the shared layer by default. The data of BioNLP13PC and CRAFT are firstly processed by a text classifier, and when the text classification outputs k₀When the value is larger than the corresponding threshold value, the task layer and the sharing layer are updated; otherwise, only updating the task layer;

extracting character feature vectors of the text, and cascading the character feature vectors and the pre-trained word vectors as input feature vectors:

7. and performing sentence segmentation and word segmentation on the document by adopting a natural language processing tool, and performing statistics on sentences, words and labels to form a sentence table, a vocabulary table and a label table. Counting characters in the word list to form a character list;

8. let C be the character table, d be the dimension of each character vector, and the character vector matrix be:

9. let the vector of the ith character of the word t be

10. using a kernel of height h

Realizing convolution, adding bias value b, then making nonlinear regression on the whole convolution result to implement characteristic mapping, and mapping function f^tThe ith element f^t(i) Is given by the formula (6). With y^t＝max_if^t(i) As a characteristic expression for the word t corresponding to the convolution kernel w.

11. Using a plurality of convolution kernels w₁,w₂,…,w_qRespectively performing the above operations to express the obtained corresponding characteristics

Spliced together, and then cascaded with the word t pre-trained GloVe 100-dimensional word vector disclosed by 6 hundred million Stanford as the input feature vector of t.

At the sharing layer, the input feature vector for each word in the sentence is modeled using bi-directional LSTM:

12. in the sharing layer, the input feature vector obtained in the step 11 is transmitted into a bidirectional LSTM, the parameter updating mode of the bidirectional LSTM neural network is that 10 is used as batchsize, parameter updating is carried out by using an Adam optimization algorithm, dropout is 0.5, the initialized learning rate is 0.015, and after each iteration, the learning rate updating formula is that

Wherein the decline rate d is 0.05, and e is the iteration number;

13. definition of x_tIs the input feature vector at time t, h_tFor the hidden layer state vector to store all useful information at time t, σ is sigmoid regression layer, and x is inner product, U_i,U_f,U_c,U_oFor input x in different states_tWeight matrix of W_i,W_f,W_c,W_oIs a hidden layer state h_tWeight matrix of b_i,b_f,b_c,b_oIs a bias vector;

14. the calculation formula for forget gate at time t is as follows:

f_t＝σ(W_fh_t-1+U_fx_t+b_f)

15. and updating the information in the cell state required to be saved to the time t, wherein the calculation formula is as follows:

i_t＝σ(W_ih_t-1+U_ix_t+b_i)

wherein

A vector that can be added to the cell state at time t;

16. combining the calculation results of the first two steps together to generate a new cell state, wherein the calculation formula is as follows:

wherein

A vector of a cell state at the time t;

17. output at time t, and update h_tThe calculation formula is as follows:

O_t＝σ(W_oh_t-1+U_ox_t+b_o)

h_t＝o_t*tanh(C_t)

wherein o is_tIs the output at time t; h is_tA vector of a hidden layer at time t;

18. h in the above step_tStoring all the past time information, and setting a g again by the same method_tFor storing future information, the last two hidden layer information are concatenated to form the final output vector.

Training each task in turn at the task level:

19. the output of BioNLP13CG at the shared layer is used as input to be transmitted into the bidirectional LSTM network of the private layer of the main task, and the output of BioNLP13PC and CRAFT at the shared layer is used as input to be transmitted into the bidirectional LSTM networks of the private layers of the auxiliary task 1 and the auxiliary task 2 respectively. Taking the output of the bidirectional LSTM as the input of the conditional random field;

and (3) carrying out entity labeling on each word by using a conditional random field:

20. with z ═ z₁,z₂,…,z_nDenotes an input sequence of conditional random fields, where n is the length of the input sequence and z is the length of the input sequence_iIs the input vector of the ith word, y ═ y₁,y₂,…,y_nY (z) ═ y'₁,y′_n,…,y′_nA possible output tag sequence of z;

21. for tag sequence y, its score is defined as:

wherein A is a transition score matrix, A_j,kRepresents the score of the transition from label j to label k; p is a fractional matrix of the output of the previous layer network, P_j,kThe score of the kth tag corresponding to the jth word.

22. For an input sequence z, the probability that its tag sequence is y is defined as:

in the training process, we maximize the log probability of the correct sequence label;

23. at the time of final decoding, the sequence y with the highest score is searched^*As the final output sequence:

24. and identifying the position of the marked words in the original file, and neatly feeding back the marking result to the user, so that the marking accuracy can be calculated. The following table was achieved:

data set	Single task	BioNLP13CG	BioNLP13PC	CRAFT
					BioNLP13CG	74.72	77.11	77.65	69.16
BioNLP13PC	88.17	78.16	89.12	77.23
					CRAFT	64.24	61.53	62.31	64.72

The single task column indicates the accuracy of 3 data sets to identify tasks as separate named entities. The accuracy of the BioNLP13CG column indicates the accuracy of the BioNLP13CG as the main task and the remaining 2 as the auxiliary tasks, and the BioNLP13PC column and the CRAFT column are the same.

From the above experimental results, the accuracy of the main task in the multitask model is generally higher than that of the single task. Therefore, the method can effectively improve the accuracy of the target task.

Claims

1. A multitask named entity recognition method combining text classification is characterized by comprising the following steps:

2. The method for multi-task named entity recognition through combined text classification according to claim 1, wherein in step S1, a text classifier is constructed by using a convolutional neural network, and the specific steps for measuring the similarity of texts are as follows:

If the sentence length is n, the sentence is represented as:

x_1:n＝[x₁；x₂；…；x_n](1)

s12: let the convolution kernel be

c_i＝f(w·x_i:i+h-1+b) (2)

the sentence-wise constructed features of length n are:

c＝[c₁；c₂；…；c_n-h+1](3)

s13: performing maximum pooling on c

Corresponding feature expressions as convolution kernel w:

where V is the input to the Softmax function, V_iAn ith element representing an input vector; s is the output of the Softmax function, S_iAnd an ith element representing an output vector represents the probability that the input sentence belongs to an ith category, and the number of the categories is M.

3. The method as claimed in claim 1, wherein the step S2 of selecting the threshold, and the specific steps of determining whether the auxiliary task data set participates in the update of the shared layer parameter according to the comparison between the text classification result and the threshold are as follows:

4. The method for multi-task named entity recognition through combined text classification as claimed in claim 1, wherein in step S3, the step of concatenating the character vector of the text and the pre-trained word vector as the input feature vector comprises:

s33: let the vector of the ith character of the word t be

s34: using a kernel of height h

f^t(i)＝tanh(w·t_i:i+h-1+b) (6)

5. The method as claimed in claim 1, wherein in step S4, the step of learning the common features of each task by modeling the input feature vector of each word in the sentence with bidirectional LSTM in the sharing layer comprises the following steps:

s42: the calculation of the forget gate at time t is shown in equation (7):

f_t＝σ(W_fh_t-1+U_fx_t+b_f) (7)

i_t＝σ(W_ih_t-1+U_ix_t+b_i) (8)

wherein

A storable proportion;

C_tcell state at time t;

o_t＝σ(W_oh_t-1+U_ox_t+b_o) (11)

h_t＝o_t*tanh(C_t) (12)

wherein o is_tDetermining the proportion of the unit state which can be used as output at the time t; h is_tHidden layer vector representing time t, asOutput information at time t;

6. The method as claimed in claim 1, wherein in step S5, the steps of training each task in turn at task layer, transmitting the output of the shared layer to the bi-directional LSTM neural network in the main task private layer or the auxiliary task private layer, then using linear chain random field to decode the label of the whole sentence, and labeling the entity in the sentence are as follows:

s53: for tag sequence y, its score is defined as: