CN113360667B

CN113360667B - Biomedical trigger word detection and named entity identification method based on multi-task learning

Info

Publication number: CN113360667B
Application number: CN202110617440.1A
Authority: CN
Inventors: 苏延森; 詹飞
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2022-07-26
Anticipated expiration: 2041-05-31
Also published as: CN113360667A

Abstract

The invention discloses a biomedical trigger word detection and named entity identification method based on multitask learning, which comprises the following steps: 1, preprocessing an unstructured biomedical text by a word segmentation and sentence segmentation technology, and labeling the preprocessed biomedical text to generate a standard data set; 2, constructing a biomedical trigger word detection and named entity recognition neural network model based on multitask learning; 3 training the neural network model and updating parameters; and 4, predicting the unlabeled data by using the trained optimal model so as to identify the trigger words and the named entities in the unlabeled data. The method can simultaneously detect the trigger words and named entity recognition in the biomedical text, thereby effectively improving the recognition accuracy and reducing the requirements on computing resources.

Description

Biomedical trigger word detection and named entity identification method based on multitask learning

Technical Field

The invention relates to the field of biomedical text mining, in particular to a biomedical trigger word detection and named entity identification method based on multi-task learning.

Background

A named entity is a specific noun or noun phrase in text that has a particularly critical meaning. Named entity recognition can be divided into recognition of entities in the general domain and in the specific domain. In the general field, entities can be divided into organization name entities, person name entities, place name entities, and the like. In the specific biomedical field, entities can be classified into cellular entities, genetic entities, protein entities, pharmaceutical entities, disease entities, and the like. Compared with named entity recognition in the general field, the named entity recognition in the biomedical field is more difficult due to entity nesting, word ambiguity and the like. Accurate identification of biomedical entities can facilitate further development of information extraction techniques and natural language processing techniques. In the biomedical field, the named entity recognition technology can extract structured biomedical entity information from a large amount of unstructured documents, and has a promoting effect on the construction of biomedical knowledge maps and databases.

The current popular named entity recognition methods can be mainly divided into rule-based methods, traditional machine learning-based methods and deep learning-based methods. The rule-based approach relies primarily on manually formulated rules including domain-specific place name dictionaries, syntactic vocabulary patterns, and the like to identify entities in text, without the need for a data set with tag annotations. The method based on the traditional machine learning mainly relies on linguistic features such as manually designed prefix and suffix features, lexical features, syntactic features and the like to train a traditional machine learning algorithm to recognize named entities. In recent years, with the advantage of deep neural networks in automatically extracting internal features of data, a plurality of named entity recognition methods based on deep learning exist. The current named entity recognition method mainly can only perform one independent entity recognition task, and the semantic information features in the text are not sufficiently extracted, so that the recognition effect of the existing method is poor.

Disclosure of Invention

The invention aims to solve the defects of the prior art and provides a biomedical trigger word detection and named entity identification method based on multi-task learning, so that the total trigger words and named entity identification of a biomedical text can be simultaneously detected, the identification accuracy can be effectively improved, and the requirement on computing resources is reduced.

In order to achieve the purpose, the invention adopts the technical scheme that:

the invention relates to a biomedical trigger word detection and named entity identification method based on multitask learning, which is characterized by comprising the following steps of:

step 1, preprocessing unstructured biomedical texts:

performing word segmentation and sentence segmentation on the unstructured biomedical text to obtain a tag-free training data set consisting of n sentence sequences, and recording the tag-free training data set as S ═ S ₁ ,S ₂ ,...,S _i ,...,S _n }; wherein S is _i Represents the ith sentence sequence, an

Represents the jth word sequence in the ith sentence sequence, and

representing the ith sentence sequence S _i The j-th word sequence

The kth character of (1); n represents the total number of sentences in the training data set, m represents the total number of words in one sentence, and K represents the total number of characters in one word;

step 2, labeling the training data set S:

step 2.1, setting the categories of the trigger words and the named entity recognition as

And

wherein,

indicates the nth trigger class, L ^ner _n Representing the nth entity category;

step 2.2, the ith sentence sequence S in the training data set S is simultaneously processed _i All of (1)The word sequence is added with the labels of the trigger word category and the named entity category, so as to obtain a labeled training data set of the trigger word detection task

Tagged training data set for recognition tasks with named entities

Wherein,

representing the ith sentence sequence S _i The jth word sequence in (c)

And trigger word class corresponding thereto

Representing the ith sentence sequence S _i The jth word sequence in (c)

And its corresponding entity class

Step 3, word vector pre-training:

obtaining biomedical documents, performing word segmentation and sentence segmentation processing to obtain a label-free training data set consisting of n 'sentence sequences, and recording the label-free training data set as S ═ S' ₁ ,S′ ₂ ,...,S′ _i′ ,...,S′ _n′ }；S′ _i′ Representing the ith 'sentence sequence, and training a label-free training data set S' through a Word2Vec tool based on a language model to obtain a pre-training Word vector matrix M;

step 4, biomedical trigger word detection based on multi-task learning and data processing of an MTL-TD-NER model of a named entity neural network; the MTL-TD-NER model consists of a word vector coding layer based on hybrid coding embedding, a feature extraction layer based on bidirectional LSTM and a classification layer based on conditional random fields;

step 4.1, the ith sentence sequence in the text form is processed by utilizing the pre-training word vector M

Word level word vector form data converted into arbitrary dimension V

Wherein,

represents the jth word

Word-level word vectors of;

step 4.2, processing data of a word vector coding layer embedded based on mixed coding, wherein the word vector coding layer is composed of bidirectional long-short term memory network units;

step 4.2.1, the jth word sequence

Each character in the character list is input into a bidirectional long-short term memory network unit at the character level and used for training the generation probability of generating corresponding words by all the characters;

step 4.2.2, extracting the jth word sequence

The first character and the last character in the bidirectional long and short term memory network unit are used as the j word sequence after being connected on the output of the hidden layer in the bidirectional long and short term memory network unit

Character level vector of

Step 4.3, for the jth word

Word-level word vector of

And character level word vectors

Splicing to obtain the jth word sequence

Mixed coded word vector of

Thereby obtaining the ith sentence sequence S _i Word vector of

And 4.4, processing data of the feature extraction layer based on the bidirectional LSTM:

the ith sentence sequence S _i Word vector of

Inputting the input into a forward LSTM network structure, and then inputting the ith sentence sequence S _i Word vector of

Inputting the data into a layer of reverse LSTM network structure, and finally, inputting the jth word vector

Hidden layer state output in two LSTM networks

And

the combination is spliced together as the jth positionWord vector of

Context feature information of

Thereby obtaining the ith sentence sequence S _i Characteristic sequence of

And 4.5, processing data of a classification layer based on the conditional random field:

step 4.5.1, constructing a classification layer of a trigger word detection task and a classification layer of a named entity recognition task, wherein the classification layer takes a conditional random field as a basic unit; constructing parallel information conversion layer transform for classification layer of trigger word detection task ^td (ii) a Building parallel information conversion layer transform for classification layer of named entity recognition task ^ner ；

Step 4.5.2, feature sequence

Inputting the input data into a classification layer of a trigger word detection task to obtain an output O ^td Then output O ^td Input to information transformation ^td Obtain the characteristic information F of the trigger word ^td Finally, the feature information F is obtained ^td And the characteristic sequence H _i Adding the entity integral characteristics to obtain an entity integral characteristic, inputting the entity integral characteristic into a classification layer of a named entity identification task to obtain a final entity identification result

Step 4.5.3, feature sequence

Inputting the input data into a classification layer of a named entity recognition task to obtain an output O ^ner Then output O ^ner Input to information transformation ^ner Obtaining the characteristic information of the entityMessage F ^ner Finally, the feature information F is obtained ^ner And the characteristic sequence H _i Adding the overall characteristics of the trigger words to obtain the overall characteristics of the trigger words, inputting the overall characteristics into a classification layer of a trigger word detection task, and obtaining a final trigger word recognition result

Step 5, training an MTL-TD-NER model to obtain an optimal trigger word detection and named entity recognition model:

step 5.1, setting parameter variables of the model:

the batch size is B, the current number of iterations is epoch _now The maximum number of iterations is epoch _max The number of iterations in which the difference value loss of the model does not decrease continuously is epoch _no The maximum number of iterations for the early-stop strategy is epoch _es ；

Step 5.2, parameter initialization:

initializing each parameter of a word vector layer based on mixed coding embedding, a feature extraction layer based on bidirectional LSTM and a classification layer based on conditional random field by adopting a uniform distribution method;

step 5.3, from the epoch _now Starting, inputting batches with the size of B in a training data set S into an MTL-TD-NER model each time, and calculating the difference value loss of an output label of the model and a correct label in the training data set S by using the formula (1) so as to update parameters in the model;

in formula (1), loss _td And loss _ner Is a loss function for triggering word detection and named entity recognition tasks, and comprises:

in formulae (2) and (3), y ^td And y ^ner A sequence of trigger word tags and named entity tags; score (y) ^td ) And score (y) ^ner ) Are respectively the ith sentence sequence S _i Inputting scores of the trigger word label and the named entity label sequence output in the MTL-TD-NER model; λ and

the system is a hyper-parameter and is used for balancing the importance degree between two tasks;

representing all possible sets of trigger word tag sequences,

represents the set of all possible entity tag sequences,

to represent

One of which triggers a sequence of word tags,

a sequence of tags of a certain entity in the representation;

step 5.4, if ecoch _now Less than epoch _max And epoch _no Less than epoch _es Then will epoch _now Add 1 and then go on to step 5.3; if epo _now Greater than or equal to epoch _max Or epoch _no Equal to epoch _es Then, the optimal trigger word detection and named entity recognition network model based on the multi-task learning is obtained;

and 6, identifying the unlabeled data by using the optimal trigger word detection and named entity identification network model so as to obtain a trigger word label and a named entity identification label of each word in the unlabeled data.

Compared with the prior art, the invention has the beneficial effects that:

1. the method is different from the traditional named entity identification method based on rules and machine learning, realizes an end-to-end neural network model, avoids the manual design of various rules such as lexical and syntactic rules and the manual extraction of linguistic features, and simplifies the implementation of trigger word detection and named entity identification.

2. The invention designs a neural network model to simultaneously process the trigger word detection task and the named entity recognition task, and adopts a hard parameter sharing mode to enable the two tasks to share the same word vector coding layer based on mixed coding embedding and the feature extraction layer based on bidirectional LSTM, thereby accelerating the training process of the model and improving the operation efficiency of the model.

3. The invention utilizes an information conversion layer to convert mutually beneficial information between the trigger word and the named entity, can better mine useful characteristic information, and respectively inputs the useful characteristic information into the classification layers to help each other to better identify the trigger word and the named entity.

4. According to the method, the trigger detection task and the named entity recognition task are trained simultaneously under the multi-task learning framework, so that data enhancement can be performed implicitly, regularization is introduced, the risk of over-fitting is effectively avoided, and the recognition accuracy is improved.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

In this embodiment, a biomedical trigger word detection and named entity recognition method based on multi-task learning mainly uses a word vector coding layer based on hybrid embedding and a feature extraction layer based on bidirectional LSTM as a common part of two tasks, and then two classification layers based on conditional random fields are respectively constructed to simultaneously perform trigger word detection and named entity recognition, specifically as shown in fig. 1, the method is performed according to the following steps:

step 1, preprocessing unstructured biomedical texts:

performing word segmentation and sentence segmentation on the unstructured biomedical text to obtain a tag-free training data set consisting of n sentence sequences, and recording the tag-free training data set as S ═ S ₁ ,S ₂ ,...,S _i ,...,S _n }; wherein S is _i Represents the ith sentence sequence, and

represents the jth word sequence in the ith sentence sequence, and

representing the ith sentence sequence S _i The j-th word sequence

step 2, labeling the training data set S:

step 2.1, setting the categories of the trigger words and the named entity recognition respectively as

And

wherein,

step 2.2, the ith in the training data set S is simultaneously treatedA sentence sequence S _i Labels of the trigger word category and the named entity category are added to all word sequences, so that a labeled training data set of a trigger word detection task is obtained

Tagged training data set for recognition tasks with named entities

Wherein,

representing the ith sentence sequence S _i The j-th word sequence

And its corresponding trigger word class

Representing the ith sentence sequence S _i The jth word sequence in (c)

And its corresponding entity class

Step 3, word vector pre-training:

in order to make word-level word vectors contain a large amount of linguistic information, a large number of biomedical documents are downloaded from a punbd database, and word segmentation and sentence segmentation are performed to obtain a tagless training data set consisting of n 'sentence sequences, which is denoted as S ═ S' ₁ ,S′ ₂ ,...,S′ _i′ ,...,S′ _n′ }；S′ _i′ Expressing the ith 'sentence sequence, and then training S' through a Word2Vec tool based on a language model to obtain a pre-training Word vector matrix M;

step 4, biomedical trigger word detection based on multitask learning and data processing of an MTL-TD-NER model of a named entity neural network; the MTL-TD-NER model consists of a word vector coding layer embedded based on hybrid coding, a feature extraction layer based on bidirectional LSTM and a classification layer based on conditional random fields;

Word level word vector form data converted into arbitrary dimension V

Wherein,

represents the jth word

Word-level word vectors.

Step 4.2, processing data based on a word vector coding layer embedded by mixed coding, wherein the word vector coding layer is composed of bidirectional long-term and short-term memory network units;

step 4.2.1, in order to obtain character level characteristic information of words, the jth word sequence is processed

Each character in the character string is input into a bidirectional long and short term memory network unit at the character level and used for training the generation probability of generating corresponding words by all the characters;

step 4.2.2, extracting the jth word sequence

The first character and the last character in the bidirectional long-short term memory network unit are connected as the jth word sequence after being output on the hidden layer in the bidirectional long-short term memory network unit

Character level vector of

Step 4.3, for the jth word

Word-level word vector of

And character level word vectors

Splicing to obtain the jth word sequence

Mixed encoded word vectors of

Thereby obtaining the ith sentence sequence S _i Word vector of

in order to obtain the characteristic information of the whole context, the ith sentence sequence S _i Word vector of

Inputting into a layer of reverse LSTM network structure, and finally, inputting the jth word vector

Hidden layer state output in two LSTM networks

And

the combinations are spliced together as a word vector at the jth position

Contextual feature information of

Thereby obtaining the ith sentence sequence S _i Characteristic sequence of (2)

4.5.1, constructing a classification layer of a trigger word detection task and a classification layer of a named entity recognition task, wherein the classification layer takes a conditional random field as a basic unit, so that the classification layer can well process the problem of label dependence; meanwhile, considering the relation that the trigger words and the entities can be correlated and promoted, parallel information conversion layer transform is constructed for the classification layer of the trigger word detection task ^td (ii) a Building parallel information conversion layer transform for classification layer of named entity recognition task ^ner ；

Step 4.5.2, feature sequence

Inputting the input into a classification layer of a trigger word detection task to obtain an output O ^td Then output O ^td Input to information transformation ^td Obtain the characteristic information F of the trigger word ^td Finally, the feature information F is processed ^td And the characteristic sequence H _i Adding the entity integral characteristics to obtain an entity integral characteristic, inputting the entity integral characteristic into a classification layer of a named entity identification task to obtain a final entity identification result

Step 4.5.3, feature sequence

Inputting the input data into a classification layer of a named entity recognition task to obtain an output O ^ner Then output O ^ner Input to information conversion transform ^ner To obtain the characteristic information F of the entity ^ner Finally, the feature information F is obtained ^ner And the characteristic sequence H _i Adding the overall characteristics of the trigger words to obtain the overall characteristics of the trigger words, inputting the overall characteristics into a classification layer of a trigger word detection task, and obtaining a final trigger word recognition result

Step 5, training an MTL-TD-NER model so as to obtain an optimal trigger word detection and named entity recognition model:

step 5.1, setting parameter variables of the model:

setting the batch size B as 50 and the current starting iteration number as epoch _now 0, the maximum number of iterations is epoch _max 100, the number of iterations epoch for which the difference of the model loss does not decrease continuously _no 0, the maximum number of iterations for the early stop strategy is epoch _es Is 15;

step 5.2, parameter initialization:

step 5.3, from the fourth epoch _now Starting, inputting batches with the size of B in a training data set S into an MTL-TD-NER model each time, and calculating the difference value loss between the output label of the model and the correct label in the training data set S by using the formula (1) so as to update the parameters in the model;

the parameters are over parameters, and are all 1 at the moment, so that the importance degree between the two tasks is balanced;

representing all possible sets of trigger word tag sequences,

represents the set of all possible entity tag sequences,

to represent

One of which triggers a sequence of word tags,

a certain sequence of entity tags in the representation;

step 5.4, if epoch _now Less than epoch _max And epoch _no Less than epoch _es Then will epoch _now Add 1 and then proceed to executeStep 5.3; if epoch _now Greater than or equal to epoch _max Or epoch _no Equal to epoch _es Then, the optimal trigger word detection and named entity recognition network model based on the multi-task learning is obtained;

and 6, identifying the unlabeled data by using the optimal trigger word detection and named entity identification network model so as to obtain the trigger word label and the named entity identification label of each word in the unlabeled data.

The method for detecting the biomedical trigger words and identifying the named entities based on the multi-task learning is provided. In the method, two different models are avoided being used for respectively carrying out trigger word detection and named entity recognition, and a multi-task learning framework is designed for simultaneously carrying out two tasks. The MTL-TD-NER model is tested on a data set to verify the effectiveness of the multi-task learning framework, and the multi-task learning framework algorithm has certain advantages in terms of trigger word detection and named entity recognition.

Claims

1. A biomedical trigger word detection and named entity recognition method based on multitask learning is characterized by comprising the following steps:

step 1, preprocessing an unstructured biomedical text:

represents the jth word sequence in the ith sentence sequence, an

Representing the ith sentence sequence S _i The jth word sequence in (c)

step 2, labeling the training data set S:

And

wherein,

denotes the nth class trigger class, L ^ner _n Representing the nth entity category;

step 2.2, the ith sentence sequence S in the training data set S is simultaneously processed _i Labels of the trigger word category and the named entity category are added to all word sequences, so that a labeled training data set of a trigger word detection task is obtained

Tagged training data set for recognition tasks with named entities

Wherein,

representing the ith sentence sequence S _i Sequence of the jth wordColumn(s) of

And trigger word class corresponding thereto

Representing the ith sentence sequence S _i The j-th word sequence

And its corresponding entity class

Step 3, word vector pre-training:

Word level word vector form data converted into arbitrary dimension V

Wherein,

represents the jth word

Word-level word vectors of;

step 4.2.1, sequence the jth word

step 4.2.2, extracting the jth word sequence

Character level vector of

Step 4.3, for the jth word

Word-level word vector of

And character level word vectors

Splicing to obtain the jth word sequence

Mixed coded word vector of

Thereby obtaining the ith sentence sequence S _i Word vector of

the ith sentence sequence S _i Word vector of

Hidden layer state output in two LSTM networks

And

the combinations are spliced together as a word vector at the j' th position

Context feature information of

Thereby obtaining the ith sentence sequence S _i Characteristic sequence of

And 4.5, processing data of the classification layer based on the conditional random field:

step 4.5.1, constructing a classification layer of a trigger word detection task and a classification layer of a named entity recognition task, wherein the classification layer takes a conditional random field as a basic unit; building parallel information conversion layer transform for classification layer of trigger word detection task ^td (ii) a Building parallel information conversion layer transform for classification layer of named entity recognition task ^ner ；

Step 4.5.2, feature sequence

Inputting the input data into a classification layer of a trigger word detection task to obtain an output O ^td Then output O ^td Input to information conversion transform ^td Obtain the characteristic information F of the trigger word ^td Finally, the feature information F is processed ^td And the characteristic sequence H _i Adding the entity integral characteristics to obtain an entity integral characteristic, inputting the entity integral characteristic into a classification layer of a named entity identification task to obtain a final entity identification result

Step 4.5.3, feature sequence

Inputting the input into a classification layer of a named entity recognition task to obtain an output O ^ner Then output O ^ner Input to information transformation ^ner To obtain the characteristic information F of the entity ^ner Finally, the feature information F is obtained ^ner And the characteristic sequence H _i Adding to obtain the overall characteristics of the trigger word and inputting the characteristics to the trigger word detection taskIn the classification layer, the final trigger word recognition result is obtained

step 5.1, setting parameter variables of the model:

the batch size is B, the current starting iteration number is epoch _now The maximum number of iterations is epoch _max The number of iterations in which the difference value loss of the model does not decrease continuously is epoch _no The maximum number of iterations for the early-stop strategy is epoch _es ；

Step 5.2, parameter initialization:

in formula (1), loss _td And loss _ner Is a loss function for triggering word detection and named entity recognition tasks and comprises:

in formulae (2) and (3), y ^td And y ^ner A sequence of trigger word tags and named entity tags; score (y) ^td ) And score (y) ^ner ) Respectively the ith sentence sequence S _i Inputting scores of the trigger word label and the named entity label sequence output in the MTL-TD-NER model; λ and

being a hyper-parameter, for balancing the importance between two tasks;

representing all possible sets of trigger word tag sequences,

represents the set of all possible entity tag sequences,

to represent

One of which triggers a sequence of word tags,

a sequence of tags of a certain entity in the representation;

step 5.4, if epoch _now Less than epoch _max And epoch _no Less than epoch _es Then will epoch _now Add 1 and then go on to step 5.3; if epoch _now Greater than or equal to epoch _max Or epoch _no Equal to epoch _es Then, the optimal trigger word detection and named entity recognition network model based on the multi-task learning is obtained;