CN113360667A

CN113360667A - Biomedical trigger word detection and named entity identification method based on multitask learning

Info

Publication number: CN113360667A
Application number: CN202110617440.1A
Authority: CN
Inventors: 苏延森; 詹飞
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2021-09-07
Anticipated expiration: 2041-05-31
Also published as: CN113360667B

Abstract

The invention discloses a biomedical trigger word detection and named entity identification method based on multitask learning, which comprises the following steps: 1, preprocessing an unstructured biomedical text by a word segmentation and sentence segmentation technology, and labeling the preprocessed biomedical text to generate a standard data set; 2, constructing a biomedical trigger word detection and named entity recognition neural network model based on multitask learning; 3 training the neural network model and updating parameters; and 4, predicting the unlabeled data by using the trained optimal model so as to identify the trigger words and the named entities in the unlabeled data. The method can simultaneously detect the trigger words and named entity recognition in the biomedical text, thereby effectively improving the recognition accuracy and reducing the requirements on computing resources.

Description

Biomedical trigger word detection and named entity identification method based on multitask learning

Technical Field

The invention relates to the field of biomedical text mining, in particular to a biomedical trigger word detection and named entity identification method based on multi-task learning.

Background

A named entity is a specific noun or noun phrase in text that has a particularly critical meaning. Named entity recognition can be divided into general domain and specific domain entity recognition. In the general field, entities can be divided into organization name entities, person name entities, place name entities, and the like. In the specific biomedical field, entities can be classified into cellular entities, genetic entities, protein entities, pharmaceutical entities, disease entities, and the like. Compared with named entity recognition in the general field, the named entity recognition in the biomedical field is more difficult due to entity nesting, word ambiguity and the like. Accurate identification of biomedical entities can facilitate further development of information extraction techniques and natural language processing techniques. In the biomedical field, the named entity recognition technology can extract structured biomedical entity information from a large amount of unstructured documents, and has a promoting effect on the construction of biomedical knowledge maps and databases.

The current popular named entity recognition methods can be mainly divided into rule-based methods, traditional machine learning-based methods and deep learning-based methods. The rule-based approach relies primarily on manually formulated rules including domain-specific place name dictionaries, syntactic vocabulary patterns, and the like to identify entities in text, without the need for a data set with tag annotations. The method based on the traditional machine learning mainly relies on linguistic features such as manually designed prefix and suffix features, lexical features, syntactic features and the like to train a traditional machine learning algorithm to recognize named entities. In recent years, with the advantage of deep neural networks in automatically extracting internal features of data, a plurality of named entity recognition methods based on deep learning exist. The current named entity recognition method can mainly only perform one independent entity recognition task, and the semantic information feature extraction in the text is not sufficient, so that the recognition effect of the existing method is poor.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a biomedical trigger word detection and named entity identification method based on multi-task learning, so that the total trigger words and named entity identification of a biomedical text can be simultaneously detected, the identification accuracy can be effectively improved, and the requirement on computing resources is reduced.

In order to achieve the purpose, the invention adopts the technical scheme that:

the invention relates to a biomedical trigger word detection and named entity identification method based on multitask learning, which is characterized by comprising the following steps of:

step 1, preprocessing unstructured biomedical texts:

performing word segmentation and sentence segmentation on the unstructured biomedical text to obtain a tag-free training data set consisting of n sentence sequences, and recording the tag-free training data set as S ═ S₁,S₂,...,S_i,...,S_n}; wherein S is_iRepresents the ith sentence sequence, and

represents the jth word sequence in the ith sentence sequence, and

representing the ith sentence sequence S_iThe j-th word sequence

The kth character of (1); n represents the total number of sentences in the training data set, m represents the total number of words in one sentence, and K represents the total number of characters in one word;

step 2, labeling the training data set S:

step 2.1, setting the categories of the trigger words and the named entity recognition respectively as

And

wherein the content of the first and second substances,

denotes the nth class trigger class, L^ner _nRepresenting the nth entity category;

step 2.2, the ith sentence sequence S in the training data set S is simultaneously processed_iLabels of the trigger word category and the named entity category are added to all word sequences, so that a labeled training data set of a trigger word detection task is obtained

Tagged training data set for recognition tasks with named entities

Wherein the content of the first and second substances,

representing the ith sentence sequence S_iThe j-th word sequence

And its corresponding trigger word class

Representing the ith sentence sequence S_iThe j-th word sequence

And its corresponding entity class

Step 3, word vector pre-training:

obtaining biomedical documents, performing word segmentation and sentence segmentation processing to obtain a label-free training data set consisting of n 'sentence sequences, and recording the label-free training data set as S ═ S'₁,S′₂,...,S′_i′,...,S′_n′}；S′_i′Representing the ith' sentence sequence, using Word2Vec tool based on language model to eliminate the markTraining the training data set S' to obtain a pre-training word vector matrix M;

step 4, biomedical trigger word detection based on multitask learning and data processing of an MTL-TD-NER model of a named entity neural network; the MTL-TD-NER model consists of a word vector coding layer embedded based on hybrid coding, a feature extraction layer based on bidirectional LSTM and a classification layer based on conditional random fields;

step 4.1, the ith sentence sequence in the text form is processed by utilizing the pre-training word vector M

Word level word vector form data converted into arbitrary dimension V

Wherein the content of the first and second substances,

represents the jth word

Word-level word vectors of;

step 4.2, processing data based on a word vector coding layer embedded by mixed coding, wherein the word vector coding layer is composed of bidirectional long-short term memory network units;

step 4.2.1, the jth word sequence

Each character in the character string is input into a bidirectional long and short term memory network unit at the character level and used for training the generation probability of generating corresponding words by all the characters;

step 4.2.2, extracting the jth word sequence

The first character and the last character in the bidirectional long and short term memory network unit are used as the j word sequence after being connected on the output of the hidden layer in the bidirectional long and short term memory network unit

Character level vector of

Step 4.3, for the jth word

Word-level word vector of

And character level word vectors

Splicing to obtain the jth word sequence

Mixed coded word vector of

Thereby obtaining the ith sentence sequence S_iWord vector of

And 4.4, processing data of the feature extraction layer based on the bidirectional LSTM:

the ith sentence sequence S_iWord vector of

Inputting the sentence into a layer of forward LSTM network structure, and then inputting the ith sentence sequence S_iWord vector of

Inputting into a layer of reverse LSTM network structure, and finally, inputting the jth word vector

Implicit in two LSTM networksLayer state output

And

the combinations are spliced together as a word vector at the j' th position

Context feature information of

Thereby obtaining the ith sentence sequence S_iCharacteristic sequence of

And 4.5, processing data of a classification layer based on the conditional random field:

step 4.5.1, constructing a classification layer of a trigger word detection task and a classification layer of a named entity recognition task, wherein the classification layer takes a conditional random field as a basic unit; building parallel information conversion layer transform for classification layer of trigger word detection task^td(ii) a Building parallel information conversion layer transform for classification layer of named entity recognition task^ner；

Step 4.5.2, feature sequence

Inputting the input into a classification layer of a trigger word detection task to obtain an output O^tdThen output O^tdInput to information transformation^tdObtain the characteristic information F of the trigger word^tdFinally, the feature information F is obtained^tdAnd the characteristic sequence H_iAdding the entity integral characteristics to obtain an entity integral characteristic, inputting the entity integral characteristic into a classification layer of a named entity identification task to obtain a final entity identification result

Step 4.5.3, mixingCharacteristic sequence

Inputting the input into a classification layer of a named entity recognition task to obtain an output O^nerThen output O^nerInput to information transformation^nerTo obtain the characteristic information F of the entity^nerFinally, the feature information F is obtained^nerAnd the characteristic sequence H_iAdding the overall characteristics of the trigger words to obtain the overall characteristics of the trigger words, inputting the overall characteristics into a classification layer of a trigger word detection task, and obtaining a final trigger word recognition result

Step 5, training an MTL-TD-NER model to obtain an optimal trigger word detection and named entity recognition model:

step 5.1, setting parameter variables of the model:

the batch size is B, the current number of iterations is epoch_nowThe maximum number of iterations is epoch_maxThe number of iterations in which the difference value loss of the model does not decrease continuously is epoch_noThe maximum number of iterations for the early-stop strategy is epoch_es；

Step 5.2, parameter initialization:

initializing each parameter of a word vector layer based on mixed coding embedding, a feature extraction layer based on bidirectional LSTM and a classification layer based on conditional random field by adopting a uniform distribution method;

step 5.3, from the epoch_nowStarting, inputting batches with the size of B in a training data set S into an MTL-TD-NER model each time, and calculating the difference value loss of an output label of the model and a correct label in the training data set S by using the formula (1) so as to update parameters in the model;

in formula (1), loss_tdAnd loss_nerIs trigger word detection and named entity recognitionA penalty function for the task, and having:

in formulae (2) and (3), y^tdAnd y^nerA sequence of trigger word tags and named entity tags; score (y)^td) And score (y)^ner) Respectively the ith sentence sequence S_iInputting scores of the trigger word label and the named entity label sequence output in the MTL-TD-NER model; λ and

the system is a hyper-parameter and is used for balancing the importance degree between two tasks; (ii) a

Representing all possible sets of trigger word tag sequences,

represents the set of all possible entity tag sequences,

to represent

One of which triggers a sequence of word tags,

a certain sequence of entity tags in the representation;

step 5.4, if epoch_nowLess than epoch_maxAnd epoch_noLess than epoch_esThen will epoch_nowAdding 1, and then continuing to execute the step 5.3; if epoch_nowGreater than or equal to epoch_maxOr epoch_noEqual to epoch_esThen, the optimal trigger word detection and named entity recognition network model based on the multi-task learning is obtained;

and 6, identifying the unlabeled data by using the optimal trigger word detection and named entity identification network model so as to obtain a trigger word label and a named entity identification label of each word in the unlabeled data.

Compared with the prior art, the invention has the beneficial effects that:

1. the method is different from the traditional named entity identification method based on rules and machine learning, realizes an end-to-end neural network model, avoids the manual design of various rules such as lexical and syntactic rules and the manual extraction of linguistic features, and simplifies the implementation of trigger word detection and named entity identification.

2. The invention designs a neural network model to simultaneously process the trigger word detection task and the named entity recognition task, and adopts a hard parameter sharing mode to enable the two tasks to share the same word vector coding layer based on mixed coding embedding and the feature extraction layer based on bidirectional LSTM, thereby accelerating the training process of the model and improving the operation efficiency of the model.

3. The invention utilizes an information conversion layer to convert the mutually beneficial information between the trigger word and the named entity, can better mine useful characteristic information, and respectively inputs the useful characteristic information into the classification layers to help each other to better identify the trigger word and the named entity.

4. According to the method, the trigger detection task and the named entity recognition task are trained simultaneously under the multi-task learning framework, so that data enhancement can be performed implicitly, regularization is introduced, the risk of over-fitting is effectively avoided, and the recognition accuracy is improved.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

In this embodiment, a biomedical trigger word detection and named entity recognition method based on multi-task learning mainly uses a word vector coding layer based on hybrid embedding and a feature extraction layer based on bidirectional LSTM as a common part of two tasks, and then two classification layers based on conditional random fields are respectively constructed to simultaneously perform trigger word detection and named entity recognition, specifically as shown in fig. 1, according to the following steps:

step 1, preprocessing unstructured biomedical texts:

represents the jth word sequence in the ith sentence sequence, and

representing the ith sentence sequence S_iThe j-th word sequence

step 2, labeling the training data set S:

And

wherein the content of the first and second substances,

Tagged training data set for recognition tasks with named entities

Wherein the content of the first and second substances,

representing the ith sentence sequence S_iThe j-th word sequence

And its corresponding trigger word class

Representing the ith sentence sequence S_iThe j-th word sequence

And its corresponding entity class

Step 3, word vector pre-training:

in order to allow word-level word vectors to contain a large amount of linguistic information, a large number of biomedical documents are downloaded from a Pumbed database, and word segmentation and sentence segmentation are performed to obtain a tagless training data set consisting of n ' sentence sequences, which is denoted as S ' ═ S '₁,S′₂,...,S′_i′,...,S′_n′}；S′_i′Expressing the ith 'sentence sequence, and then training S' by a Word2Vec tool based on a language model to obtain a pre-training Word vector matrix M;

Word level word vector form data converted into arbitrary dimension V

Wherein the content of the first and second substances,

represents the jth word

Word-level word vectors.

Step 4.2, processing data based on a word vector coding layer embedded by mixed coding, wherein the word vector coding layer is composed of bidirectional long-term and short-term memory network units;

step 4.2.1, in order to obtain the character level characteristic information of the word, the jth word sequence is processed

step 4.2.2, extracting the jth word sequence

First word inThe character and the last character are used as the j word sequence after the output connection on the hidden layer in the bidirectional long and short term memory network unit

Character level vector of

Step 4.3, for the jth word

Word-level word vector of

And character level word vectors

Splicing to obtain the jth word sequence

Mixed coded word vector of

Thereby obtaining the ith sentence sequence S_iWord vector of

in order to obtain the characteristic information of the whole context, the ith sentence sequence S_iWord vector of

Input to a layer of inverted LSTMIn the network structure, the jth word vector is finally added

Implicit layer state output in two LSTM networks

And

the combinations are spliced together as a word vector at the j' th position

Context feature information of

Thereby obtaining the ith sentence sequence S_iCharacteristic sequence of

4.5.1, constructing a classification layer of a trigger word detection task and a classification layer of a named entity recognition task, wherein the classification layer takes a conditional random field as a basic unit, so that the classification layer can well process the problem of label dependence; meanwhile, considering the relation that the trigger words and the entities can be correlated and promoted, parallel information conversion layer transform is constructed for the classification layer of the trigger word detection task^td(ii) a Building parallel information conversion layer transform for classification layer of named entity recognition task^ner；

Step 4.5.2, feature sequence

Step 4.5.3, feature sequence

step 5.1, setting parameter variables of the model:

setting the batch size B as 50 and the current starting iteration number as epoch_now0, the maximum number of iterations is epoch_max100, the number of iterations epoch for which the model difference loss does not decrease continuously_no0, the maximum number of iterations for the early stop strategy is epoch_esIs 15;

step 5.2, parameter initialization:

step 5.3, from the epoch_nowInitially, the training data set S is input into the MTL-TD-NER model in batches of size B each time, and the output labels of the model and the correctness in the training data set S are calculated using equation (1)The difference value loss of the labels to update the parameters in the model;

in formula (1), loss_tdAnd loss_nerIs a loss function for triggering word detection and named entity recognition tasks and comprises:

the parameters are over parameters, and are all 1 at the moment, so that the importance degree between the two tasks is balanced;

representing all possible sets of trigger word tag sequences,

represents the set of all possible entity tag sequences,

to represent

One of which triggers a sequence of word tags,

a certain sequence of entity tags in the representation;

The method for detecting the biomedical trigger words and identifying the named entities based on the multitask learning is provided. In the method, two different models are avoided being used for respectively carrying out trigger word detection and named entity recognition, and a multi-task learning framework is designed to simultaneously carry out two tasks. The MTL-TD-NER model is subjected to experiment on a data set to verify the effectiveness of the provided multi-task learning framework, and the provided multi-task learning framework algorithm is proved to have certain advantages in terms of trigger detection and named entity identification.

Claims

1. A biomedical trigger word detection and named entity recognition method based on multitask learning is characterized by comprising the following steps:

step 1, preprocessing unstructured biomedical texts:

represents the jth word sequence in the ith sentence sequence, and

representing the ith sentence sequence S_iThe j-th word sequence

step 2, labeling the training data set S:

And

wherein the content of the first and second substances,

Tagged training data set for recognition tasks with named entities

Wherein the content of the first and second substances,

representing the ith sentence sequence S_iThe j-th word sequence

And its corresponding trigger word class

Representing the ith sentence sequence S_iThe j-th word sequence

And its corresponding entity class

Step 3, word vector pre-training:

obtaining biomedical documents, performing word segmentation and sentence segmentation processing to obtain a label-free training data set consisting of n 'sentence sequences, and recording the label-free training data set as S ═ S'₁,S′₂,...,S′_i′,...,S′_n′}；S′_i′Representing the ith 'sentence sequence, and training a label-free training data set S' through a Word2Vec tool based on a language model to obtain a pre-training Word vector matrix M;

step 4.1, the first step of text form is carried out by utilizing the pre-training word vector Mi sentence sequences

Word level word vector form data converted into arbitrary dimension V

Wherein the content of the first and second substances,

represents the jth word

Word-level word vectors of;

step 4.2.1, the jth word sequence

step 4.2.2, extracting the jth word sequence

Character level vector of

Step 4.3, for the jth word

Word-level word vector of

And character level word vectors

Splicing to obtain the jth word sequence

Mixed coded word vector of

Thereby obtaining the ith sentence sequence S_iWord vector of

the ith sentence sequence S_iWord vector of

Implicit layer state output in two LSTM networks

And

the combinations are spliced together as a word vector at the j' th position

Context feature information of

Thereby obtaining the ith sentence sequence S_iCharacteristic sequence of

Step 4.5.2, feature sequence

Step 4.5.3, feature sequence

Input to namingIn the classification layer of the entity recognition task, an output O is obtained^nerThen output O^nerInput to information transformation^nerTo obtain the characteristic information F of the entity^nerFinally, the feature information F is obtained^nerAnd the characteristic sequence H_iAdding the overall characteristics of the trigger words to obtain the overall characteristics of the trigger words, inputting the overall characteristics into a classification layer of a trigger word detection task, and obtaining a final trigger word recognition result

step 5.1, setting parameter variables of the model:

Step 5.2, parameter initialization:

Representing all possible sets of trigger word tag sequences,

represents the set of all possible entity tag sequences,

to represent

One of which triggers a sequence of word tags,

a certain sequence of entity tags in the representation;