CN113360667B - Biomedical trigger word detection and named entity identification method based on multi-task learning - Google Patents
Biomedical trigger word detection and named entity identification method based on multi-task learning Download PDFInfo
- Publication number
- CN113360667B CN113360667B CN202110617440.1A CN202110617440A CN113360667B CN 113360667 B CN113360667 B CN 113360667B CN 202110617440 A CN202110617440 A CN 202110617440A CN 113360667 B CN113360667 B CN 113360667B
- Authority
- CN
- China
- Prior art keywords
- word
- sequence
- trigger
- named entity
- ner
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 45
- 238000000034 method Methods 0.000 title claims abstract description 32
- 238000012549 training Methods 0.000 claims abstract description 54
- 230000011218 segmentation Effects 0.000 claims abstract description 14
- 238000002372 labelling Methods 0.000 claims abstract description 4
- 238000007781 pre-processing Methods 0.000 claims abstract description 4
- 239000013598 vector Substances 0.000 claims description 57
- 230000002457 bidirectional effect Effects 0.000 claims description 23
- 238000012545 processing Methods 0.000 claims description 14
- 238000000605 extraction Methods 0.000 claims description 13
- 238000006243 chemical reaction Methods 0.000 claims description 9
- 230000015654 memory Effects 0.000 claims description 7
- 230000007787 long-term memory Effects 0.000 claims description 5
- 230000006403 short-term memory Effects 0.000 claims description 5
- 238000013528 artificial neural network Methods 0.000 claims description 4
- 230000009466 transformation Effects 0.000 claims description 4
- 230000006870 function Effects 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 3
- 238000009827 uniform distribution Methods 0.000 claims description 3
- 238000003062 neural network model Methods 0.000 abstract description 4
- 238000005516 engineering process Methods 0.000 abstract description 2
- 230000000875 corresponding effect Effects 0.000 description 6
- 238000010801 machine learning Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000002068 genetic effect Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 102000004169 proteins and genes Human genes 0.000 description 1
- 108090000623 proteins and genes Proteins 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Data Mining & Analysis (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Animal Behavior & Ethology (AREA)
- Databases & Information Systems (AREA)
- Biophysics (AREA)
- Machine Translation (AREA)
- Character Discrimination (AREA)
Abstract
The invention discloses a biomedical trigger word detection and named entity identification method based on multitask learning, which comprises the following steps: 1, preprocessing an unstructured biomedical text by a word segmentation and sentence segmentation technology, and labeling the preprocessed biomedical text to generate a standard data set; 2, constructing a biomedical trigger word detection and named entity recognition neural network model based on multitask learning; 3 training the neural network model and updating parameters; and 4, predicting the unlabeled data by using the trained optimal model so as to identify the trigger words and the named entities in the unlabeled data. The method can simultaneously detect the trigger words and named entity recognition in the biomedical text, thereby effectively improving the recognition accuracy and reducing the requirements on computing resources.
Description
Technical Field
The invention relates to the field of biomedical text mining, in particular to a biomedical trigger word detection and named entity identification method based on multi-task learning.
Background
A named entity is a specific noun or noun phrase in text that has a particularly critical meaning. Named entity recognition can be divided into recognition of entities in the general domain and in the specific domain. In the general field, entities can be divided into organization name entities, person name entities, place name entities, and the like. In the specific biomedical field, entities can be classified into cellular entities, genetic entities, protein entities, pharmaceutical entities, disease entities, and the like. Compared with named entity recognition in the general field, the named entity recognition in the biomedical field is more difficult due to entity nesting, word ambiguity and the like. Accurate identification of biomedical entities can facilitate further development of information extraction techniques and natural language processing techniques. In the biomedical field, the named entity recognition technology can extract structured biomedical entity information from a large amount of unstructured documents, and has a promoting effect on the construction of biomedical knowledge maps and databases.
The current popular named entity recognition methods can be mainly divided into rule-based methods, traditional machine learning-based methods and deep learning-based methods. The rule-based approach relies primarily on manually formulated rules including domain-specific place name dictionaries, syntactic vocabulary patterns, and the like to identify entities in text, without the need for a data set with tag annotations. The method based on the traditional machine learning mainly relies on linguistic features such as manually designed prefix and suffix features, lexical features, syntactic features and the like to train a traditional machine learning algorithm to recognize named entities. In recent years, with the advantage of deep neural networks in automatically extracting internal features of data, a plurality of named entity recognition methods based on deep learning exist. The current named entity recognition method mainly can only perform one independent entity recognition task, and the semantic information features in the text are not sufficiently extracted, so that the recognition effect of the existing method is poor.
Disclosure of Invention
The invention aims to solve the defects of the prior art and provides a biomedical trigger word detection and named entity identification method based on multi-task learning, so that the total trigger words and named entity identification of a biomedical text can be simultaneously detected, the identification accuracy can be effectively improved, and the requirement on computing resources is reduced.
In order to achieve the purpose, the invention adopts the technical scheme that:
the invention relates to a biomedical trigger word detection and named entity identification method based on multitask learning, which is characterized by comprising the following steps of:
step 1, preprocessing unstructured biomedical texts:
performing word segmentation and sentence segmentation on the unstructured biomedical text to obtain a tag-free training data set consisting of n sentence sequences, and recording the tag-free training data set as S ═ S 1 ,S 2 ,...,S i ,...,S n }; wherein S is i Represents the ith sentence sequence, an Represents the jth word sequence in the ith sentence sequence, andrepresenting the ith sentence sequence S i The j-th word sequenceThe kth character of (1); n represents the total number of sentences in the training data set, m represents the total number of words in one sentence, and K represents the total number of characters in one word;
step 2, labeling the training data set S:
step 2.1, setting the categories of the trigger words and the named entity recognition asAndwherein,indicates the nth trigger class, L ner n Representing the nth entity category;
step 2.2, the ith sentence sequence S in the training data set S is simultaneously processed i All of (1)The word sequence is added with the labels of the trigger word category and the named entity category, so as to obtain a labeled training data set of the trigger word detection taskTagged training data set for recognition tasks with named entitiesWherein,representing the ith sentence sequence S i The jth word sequence in (c)And trigger word class corresponding theretoRepresenting the ith sentence sequence S i The jth word sequence in (c)And its corresponding entity class
Step 3, word vector pre-training:
obtaining biomedical documents, performing word segmentation and sentence segmentation processing to obtain a label-free training data set consisting of n 'sentence sequences, and recording the label-free training data set as S ═ S' 1 ,S′ 2 ,...,S′ i′ ,...,S′ n′ };S′ i′ Representing the ith 'sentence sequence, and training a label-free training data set S' through a Word2Vec tool based on a language model to obtain a pre-training Word vector matrix M;
step 4, biomedical trigger word detection based on multi-task learning and data processing of an MTL-TD-NER model of a named entity neural network; the MTL-TD-NER model consists of a word vector coding layer based on hybrid coding embedding, a feature extraction layer based on bidirectional LSTM and a classification layer based on conditional random fields;
step 4.1, the ith sentence sequence in the text form is processed by utilizing the pre-training word vector MWord level word vector form data converted into arbitrary dimension VWherein,represents the jth wordWord-level word vectors of;
step 4.2, processing data of a word vector coding layer embedded based on mixed coding, wherein the word vector coding layer is composed of bidirectional long-short term memory network units;
step 4.2.1, the jth word sequenceEach character in the character list is input into a bidirectional long-short term memory network unit at the character level and used for training the generation probability of generating corresponding words by all the characters;
step 4.2.2, extracting the jth word sequenceThe first character and the last character in the bidirectional long and short term memory network unit are used as the j word sequence after being connected on the output of the hidden layer in the bidirectional long and short term memory network unitCharacter level vector of
Step 4.3, for the jth wordWord-level word vector ofAnd character level word vectorsSplicing to obtain the jth word sequenceMixed coded word vector ofThereby obtaining the ith sentence sequence S i Word vector of
And 4.4, processing data of the feature extraction layer based on the bidirectional LSTM:
the ith sentence sequence S i Word vector ofInputting the input into a forward LSTM network structure, and then inputting the ith sentence sequence S i Word vector ofInputting the data into a layer of reverse LSTM network structure, and finally, inputting the jth word vectorHidden layer state output in two LSTM networksAndthe combination is spliced together as the jth positionWord vector ofContext feature information ofThereby obtaining the ith sentence sequence S i Characteristic sequence of
And 4.5, processing data of a classification layer based on the conditional random field:
step 4.5.1, constructing a classification layer of a trigger word detection task and a classification layer of a named entity recognition task, wherein the classification layer takes a conditional random field as a basic unit; constructing parallel information conversion layer transform for classification layer of trigger word detection task td (ii) a Building parallel information conversion layer transform for classification layer of named entity recognition task ner ;
Step 4.5.2, feature sequenceInputting the input data into a classification layer of a trigger word detection task to obtain an output O td Then output O td Input to information transformation td Obtain the characteristic information F of the trigger word td Finally, the feature information F is obtained td And the characteristic sequence H i Adding the entity integral characteristics to obtain an entity integral characteristic, inputting the entity integral characteristic into a classification layer of a named entity identification task to obtain a final entity identification result
Step 4.5.3, feature sequenceInputting the input data into a classification layer of a named entity recognition task to obtain an output O ner Then output O ner Input to information transformation ner Obtaining the characteristic information of the entityMessage F ner Finally, the feature information F is obtained ner And the characteristic sequence H i Adding the overall characteristics of the trigger words to obtain the overall characteristics of the trigger words, inputting the overall characteristics into a classification layer of a trigger word detection task, and obtaining a final trigger word recognition result
Step 5, training an MTL-TD-NER model to obtain an optimal trigger word detection and named entity recognition model:
step 5.1, setting parameter variables of the model:
the batch size is B, the current number of iterations is epoch now The maximum number of iterations is epoch max The number of iterations in which the difference value loss of the model does not decrease continuously is epoch no The maximum number of iterations for the early-stop strategy is epoch es ;
Step 5.2, parameter initialization:
initializing each parameter of a word vector layer based on mixed coding embedding, a feature extraction layer based on bidirectional LSTM and a classification layer based on conditional random field by adopting a uniform distribution method;
step 5.3, from the epoch now Starting, inputting batches with the size of B in a training data set S into an MTL-TD-NER model each time, and calculating the difference value loss of an output label of the model and a correct label in the training data set S by using the formula (1) so as to update parameters in the model;
in formula (1), loss td And loss ner Is a loss function for triggering word detection and named entity recognition tasks, and comprises:
in formulae (2) and (3), y td And y ner A sequence of trigger word tags and named entity tags; score (y) td ) And score (y) ner ) Are respectively the ith sentence sequence S i Inputting scores of the trigger word label and the named entity label sequence output in the MTL-TD-NER model; λ andthe system is a hyper-parameter and is used for balancing the importance degree between two tasks;representing all possible sets of trigger word tag sequences,represents the set of all possible entity tag sequences,to representOne of which triggers a sequence of word tags,a sequence of tags of a certain entity in the representation;
step 5.4, if ecoch now Less than epoch max And epoch no Less than epoch es Then will epoch now Add 1 and then go on to step 5.3; if epo now Greater than or equal to epoch max Or epoch no Equal to epoch es Then, the optimal trigger word detection and named entity recognition network model based on the multi-task learning is obtained;
and 6, identifying the unlabeled data by using the optimal trigger word detection and named entity identification network model so as to obtain a trigger word label and a named entity identification label of each word in the unlabeled data.
Compared with the prior art, the invention has the beneficial effects that:
1. the method is different from the traditional named entity identification method based on rules and machine learning, realizes an end-to-end neural network model, avoids the manual design of various rules such as lexical and syntactic rules and the manual extraction of linguistic features, and simplifies the implementation of trigger word detection and named entity identification.
2. The invention designs a neural network model to simultaneously process the trigger word detection task and the named entity recognition task, and adopts a hard parameter sharing mode to enable the two tasks to share the same word vector coding layer based on mixed coding embedding and the feature extraction layer based on bidirectional LSTM, thereby accelerating the training process of the model and improving the operation efficiency of the model.
3. The invention utilizes an information conversion layer to convert mutually beneficial information between the trigger word and the named entity, can better mine useful characteristic information, and respectively inputs the useful characteristic information into the classification layers to help each other to better identify the trigger word and the named entity.
4. According to the method, the trigger detection task and the named entity recognition task are trained simultaneously under the multi-task learning framework, so that data enhancement can be performed implicitly, regularization is introduced, the risk of over-fitting is effectively avoided, and the recognition accuracy is improved.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
In this embodiment, a biomedical trigger word detection and named entity recognition method based on multi-task learning mainly uses a word vector coding layer based on hybrid embedding and a feature extraction layer based on bidirectional LSTM as a common part of two tasks, and then two classification layers based on conditional random fields are respectively constructed to simultaneously perform trigger word detection and named entity recognition, specifically as shown in fig. 1, the method is performed according to the following steps:
step 1, preprocessing unstructured biomedical texts:
performing word segmentation and sentence segmentation on the unstructured biomedical text to obtain a tag-free training data set consisting of n sentence sequences, and recording the tag-free training data set as S ═ S 1 ,S 2 ,...,S i ,...,S n }; wherein S is i Represents the ith sentence sequence, and represents the jth word sequence in the ith sentence sequence, and representing the ith sentence sequence S i The j-th word sequenceThe kth character of (1); n represents the total number of sentences in the training data set, m represents the total number of words in one sentence, and K represents the total number of characters in one word;
step 2, labeling the training data set S:
step 2.1, setting the categories of the trigger words and the named entity recognition respectively asAndwherein,indicates the nth trigger class, L ner n Representing the nth entity category;
step 2.2, the ith in the training data set S is simultaneously treatedA sentence sequence S i Labels of the trigger word category and the named entity category are added to all word sequences, so that a labeled training data set of a trigger word detection task is obtainedTagged training data set for recognition tasks with named entitiesWherein,representing the ith sentence sequence S i The j-th word sequenceAnd its corresponding trigger word classRepresenting the ith sentence sequence S i The jth word sequence in (c)And its corresponding entity class
Step 3, word vector pre-training:
in order to make word-level word vectors contain a large amount of linguistic information, a large number of biomedical documents are downloaded from a punbd database, and word segmentation and sentence segmentation are performed to obtain a tagless training data set consisting of n 'sentence sequences, which is denoted as S ═ S' 1 ,S′ 2 ,...,S′ i′ ,...,S′ n′ };S′ i′ Expressing the ith 'sentence sequence, and then training S' through a Word2Vec tool based on a language model to obtain a pre-training Word vector matrix M;
step 4, biomedical trigger word detection based on multitask learning and data processing of an MTL-TD-NER model of a named entity neural network; the MTL-TD-NER model consists of a word vector coding layer embedded based on hybrid coding, a feature extraction layer based on bidirectional LSTM and a classification layer based on conditional random fields;
step 4.1, the ith sentence sequence in the text form is processed by utilizing the pre-training word vector MWord level word vector form data converted into arbitrary dimension VWherein,represents the jth wordWord-level word vectors.
Step 4.2, processing data based on a word vector coding layer embedded by mixed coding, wherein the word vector coding layer is composed of bidirectional long-term and short-term memory network units;
step 4.2.1, in order to obtain character level characteristic information of words, the jth word sequence is processedEach character in the character string is input into a bidirectional long and short term memory network unit at the character level and used for training the generation probability of generating corresponding words by all the characters;
step 4.2.2, extracting the jth word sequenceThe first character and the last character in the bidirectional long-short term memory network unit are connected as the jth word sequence after being output on the hidden layer in the bidirectional long-short term memory network unitCharacter level vector of
Step 4.3, for the jth wordWord-level word vector ofAnd character level word vectorsSplicing to obtain the jth word sequenceMixed encoded word vectors ofThereby obtaining the ith sentence sequence S i Word vector of
And 4.4, processing data of the feature extraction layer based on the bidirectional LSTM:
in order to obtain the characteristic information of the whole context, the ith sentence sequence S i Word vector ofInputting the input into a forward LSTM network structure, and then inputting the ith sentence sequence S i Word vector ofInputting into a layer of reverse LSTM network structure, and finally, inputting the jth word vectorHidden layer state output in two LSTM networksAndthe combinations are spliced together as a word vector at the jth positionContextual feature information ofThereby obtaining the ith sentence sequence S i Characteristic sequence of (2)
And 4.5, processing data of a classification layer based on the conditional random field:
4.5.1, constructing a classification layer of a trigger word detection task and a classification layer of a named entity recognition task, wherein the classification layer takes a conditional random field as a basic unit, so that the classification layer can well process the problem of label dependence; meanwhile, considering the relation that the trigger words and the entities can be correlated and promoted, parallel information conversion layer transform is constructed for the classification layer of the trigger word detection task td (ii) a Building parallel information conversion layer transform for classification layer of named entity recognition task ner ;
Step 4.5.2, feature sequenceInputting the input into a classification layer of a trigger word detection task to obtain an output O td Then output O td Input to information transformation td Obtain the characteristic information F of the trigger word td Finally, the feature information F is processed td And the characteristic sequence H i Adding the entity integral characteristics to obtain an entity integral characteristic, inputting the entity integral characteristic into a classification layer of a named entity identification task to obtain a final entity identification result
Step 4.5.3, feature sequenceInputting the input data into a classification layer of a named entity recognition task to obtain an output O ner Then output O ner Input to information conversion transform ner To obtain the characteristic information F of the entity ner Finally, the feature information F is obtained ner And the characteristic sequence H i Adding the overall characteristics of the trigger words to obtain the overall characteristics of the trigger words, inputting the overall characteristics into a classification layer of a trigger word detection task, and obtaining a final trigger word recognition result
Step 5, training an MTL-TD-NER model so as to obtain an optimal trigger word detection and named entity recognition model:
step 5.1, setting parameter variables of the model:
setting the batch size B as 50 and the current starting iteration number as epoch now 0, the maximum number of iterations is epoch max 100, the number of iterations epoch for which the difference of the model loss does not decrease continuously no 0, the maximum number of iterations for the early stop strategy is epoch es Is 15;
step 5.2, parameter initialization:
initializing each parameter of a word vector layer based on mixed coding embedding, a feature extraction layer based on bidirectional LSTM and a classification layer based on conditional random field by adopting a uniform distribution method;
step 5.3, from the fourth epoch now Starting, inputting batches with the size of B in a training data set S into an MTL-TD-NER model each time, and calculating the difference value loss between the output label of the model and the correct label in the training data set S by using the formula (1) so as to update the parameters in the model;
in formula (1), loss td And loss ner Is a loss function for triggering word detection and named entity recognition tasks, and comprises:
in formulae (2) and (3), y td And y ner A sequence of trigger word tags and named entity tags; score (y) td ) And score (y) ner ) Are respectively the ith sentence sequence S i Inputting scores of the trigger word label and the named entity label sequence output in the MTL-TD-NER model; λ andthe parameters are over parameters, and are all 1 at the moment, so that the importance degree between the two tasks is balanced;representing all possible sets of trigger word tag sequences,represents the set of all possible entity tag sequences,to representOne of which triggers a sequence of word tags,a certain sequence of entity tags in the representation;
step 5.4, if epoch now Less than epoch max And epoch no Less than epoch es Then will epoch now Add 1 and then proceed to executeStep 5.3; if epoch now Greater than or equal to epoch max Or epoch no Equal to epoch es Then, the optimal trigger word detection and named entity recognition network model based on the multi-task learning is obtained;
and 6, identifying the unlabeled data by using the optimal trigger word detection and named entity identification network model so as to obtain the trigger word label and the named entity identification label of each word in the unlabeled data.
The method for detecting the biomedical trigger words and identifying the named entities based on the multi-task learning is provided. In the method, two different models are avoided being used for respectively carrying out trigger word detection and named entity recognition, and a multi-task learning framework is designed for simultaneously carrying out two tasks. The MTL-TD-NER model is tested on a data set to verify the effectiveness of the multi-task learning framework, and the multi-task learning framework algorithm has certain advantages in terms of trigger word detection and named entity recognition.
Claims (1)
1. A biomedical trigger word detection and named entity recognition method based on multitask learning is characterized by comprising the following steps:
step 1, preprocessing an unstructured biomedical text:
performing word segmentation and sentence segmentation on the unstructured biomedical text to obtain a tag-free training data set consisting of n sentence sequences, and recording the tag-free training data set as S ═ S 1 ,S 2 ,...,S i ,...,S n }; wherein S is i Represents the ith sentence sequence, and represents the jth word sequence in the ith sentence sequence, an Representing the ith sentence sequence S i The jth word sequence in (c)The kth character of (1); n represents the total number of sentences in the training data set, m represents the total number of words in one sentence, and K represents the total number of characters in one word;
step 2, labeling the training data set S:
step 2.1, setting the categories of the trigger words and the named entity recognition asAndwherein,denotes the nth class trigger class, L ner n Representing the nth entity category;
step 2.2, the ith sentence sequence S in the training data set S is simultaneously processed i Labels of the trigger word category and the named entity category are added to all word sequences, so that a labeled training data set of a trigger word detection task is obtainedTagged training data set for recognition tasks with named entitiesWherein,representing the ith sentence sequence S i Sequence of the jth wordColumn(s) ofAnd trigger word class corresponding thereto Representing the ith sentence sequence S i The j-th word sequenceAnd its corresponding entity class
Step 3, word vector pre-training:
obtaining biomedical documents, performing word segmentation and sentence segmentation processing to obtain a label-free training data set consisting of n 'sentence sequences, and recording the label-free training data set as S ═ S' 1 ,S′ 2 ,...,S′ i′ ,...,S′ n′ };S′ i′ Representing the ith 'sentence sequence, and training a label-free training data set S' through a Word2Vec tool based on a language model to obtain a pre-training Word vector matrix M;
step 4, biomedical trigger word detection based on multi-task learning and data processing of an MTL-TD-NER model of a named entity neural network; the MTL-TD-NER model consists of a word vector coding layer based on hybrid coding embedding, a feature extraction layer based on bidirectional LSTM and a classification layer based on conditional random fields;
step 4.1, the ith sentence sequence in the text form is processed by utilizing the pre-training word vector MWord level word vector form data converted into arbitrary dimension VWherein,represents the jth wordWord-level word vectors of;
step 4.2, processing data of a word vector coding layer embedded based on mixed coding, wherein the word vector coding layer is composed of bidirectional long-short term memory network units;
step 4.2.1, sequence the jth wordEach character in the character string is input into a bidirectional long and short term memory network unit at the character level and used for training the generation probability of generating corresponding words by all the characters;
step 4.2.2, extracting the jth word sequenceThe first character and the last character in the bidirectional long-short term memory network unit are connected as the jth word sequence after being output on the hidden layer in the bidirectional long-short term memory network unitCharacter level vector of
Step 4.3, for the jth wordWord-level word vector ofAnd character level word vectorsSplicing to obtain the jth word sequenceMixed coded word vector ofThereby obtaining the ith sentence sequence S i Word vector of
And 4.4, processing data of the feature extraction layer based on the bidirectional LSTM:
the ith sentence sequence S i Word vector ofInputting the input into a forward LSTM network structure, and then inputting the ith sentence sequence S i Word vector ofInputting into a layer of reverse LSTM network structure, and finally, inputting the jth word vectorHidden layer state output in two LSTM networksAndthe combinations are spliced together as a word vector at the j' th positionContext feature information ofThereby obtaining the ith sentence sequence S i Characteristic sequence of
And 4.5, processing data of the classification layer based on the conditional random field:
step 4.5.1, constructing a classification layer of a trigger word detection task and a classification layer of a named entity recognition task, wherein the classification layer takes a conditional random field as a basic unit; building parallel information conversion layer transform for classification layer of trigger word detection task td (ii) a Building parallel information conversion layer transform for classification layer of named entity recognition task ner ;
Step 4.5.2, feature sequenceInputting the input data into a classification layer of a trigger word detection task to obtain an output O td Then output O td Input to information conversion transform td Obtain the characteristic information F of the trigger word td Finally, the feature information F is processed td And the characteristic sequence H i Adding the entity integral characteristics to obtain an entity integral characteristic, inputting the entity integral characteristic into a classification layer of a named entity identification task to obtain a final entity identification result
Step 4.5.3, feature sequenceInputting the input into a classification layer of a named entity recognition task to obtain an output O ner Then output O ner Input to information transformation ner To obtain the characteristic information F of the entity ner Finally, the feature information F is obtained ner And the characteristic sequence H i Adding to obtain the overall characteristics of the trigger word and inputting the characteristics to the trigger word detection taskIn the classification layer, the final trigger word recognition result is obtained
Step 5, training an MTL-TD-NER model to obtain an optimal trigger word detection and named entity recognition model:
step 5.1, setting parameter variables of the model:
the batch size is B, the current starting iteration number is epoch now The maximum number of iterations is epoch max The number of iterations in which the difference value loss of the model does not decrease continuously is epoch no The maximum number of iterations for the early-stop strategy is epoch es ;
Step 5.2, parameter initialization:
initializing each parameter of a word vector layer based on mixed coding embedding, a feature extraction layer based on bidirectional LSTM and a classification layer based on conditional random field by adopting a uniform distribution method;
step 5.3, from the fourth epoch now Starting, inputting batches with the size of B in a training data set S into an MTL-TD-NER model each time, and calculating the difference value loss between the output label of the model and the correct label in the training data set S by using the formula (1) so as to update the parameters in the model;
in formula (1), loss td And loss ner Is a loss function for triggering word detection and named entity recognition tasks and comprises:
in formulae (2) and (3), y td And y ner A sequence of trigger word tags and named entity tags; score (y) td ) And score (y) ner ) Respectively the ith sentence sequence S i Inputting scores of the trigger word label and the named entity label sequence output in the MTL-TD-NER model; λ andbeing a hyper-parameter, for balancing the importance between two tasks;representing all possible sets of trigger word tag sequences,represents the set of all possible entity tag sequences,to representOne of which triggers a sequence of word tags,a sequence of tags of a certain entity in the representation;
step 5.4, if epoch now Less than epoch max And epoch no Less than epoch es Then will epoch now Add 1 and then go on to step 5.3; if epoch now Greater than or equal to epoch max Or epoch no Equal to epoch es Then, the optimal trigger word detection and named entity recognition network model based on the multi-task learning is obtained;
and 6, identifying the unlabeled data by using the optimal trigger word detection and named entity identification network model so as to obtain a trigger word label and a named entity identification label of each word in the unlabeled data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110617440.1A CN113360667B (en) | 2021-05-31 | 2021-05-31 | Biomedical trigger word detection and named entity identification method based on multi-task learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110617440.1A CN113360667B (en) | 2021-05-31 | 2021-05-31 | Biomedical trigger word detection and named entity identification method based on multi-task learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113360667A CN113360667A (en) | 2021-09-07 |
CN113360667B true CN113360667B (en) | 2022-07-26 |
Family
ID=77531522
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110617440.1A Active CN113360667B (en) | 2021-05-31 | 2021-05-31 | Biomedical trigger word detection and named entity identification method based on multi-task learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113360667B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113553853B (en) * | 2021-09-16 | 2022-01-21 | 南方电网数字电网研究院有限公司 | Named entity recognition method and device, computer equipment and storage medium |
CN114580422B (en) * | 2022-03-14 | 2022-12-13 | 昆明理工大学 | Named entity identification method combining two-stage classification of neighbor analysis |
CN114970536B (en) * | 2022-06-22 | 2024-08-16 | 昆明理工大学 | Combined lexical analysis method for word segmentation, part-of-speech tagging and named entity recognition |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104965819A (en) * | 2015-07-12 | 2015-10-07 | 大连理工大学 | Biomedical event trigger word identification method based on syntactic word vector |
CN105512209A (en) * | 2015-11-28 | 2016-04-20 | 大连理工大学 | Biomedicine event trigger word identification method based on characteristic automatic learning |
CN108628970A (en) * | 2018-04-17 | 2018-10-09 | 大连理工大学 | A kind of biomedical event joint abstracting method based on new marking mode |
CN111222318A (en) * | 2019-11-19 | 2020-06-02 | 陈一飞 | Trigger word recognition method based on two-channel bidirectional LSTM-CRF network |
WO2020193966A1 (en) * | 2019-03-26 | 2020-10-01 | Benevolentai Technology Limited | Name entity recognition with deep learning |
-
2021
- 2021-05-31 CN CN202110617440.1A patent/CN113360667B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104965819A (en) * | 2015-07-12 | 2015-10-07 | 大连理工大学 | Biomedical event trigger word identification method based on syntactic word vector |
CN105512209A (en) * | 2015-11-28 | 2016-04-20 | 大连理工大学 | Biomedicine event trigger word identification method based on characteristic automatic learning |
CN108628970A (en) * | 2018-04-17 | 2018-10-09 | 大连理工大学 | A kind of biomedical event joint abstracting method based on new marking mode |
WO2020193966A1 (en) * | 2019-03-26 | 2020-10-01 | Benevolentai Technology Limited | Name entity recognition with deep learning |
CN111222318A (en) * | 2019-11-19 | 2020-06-02 | 陈一飞 | Trigger word recognition method based on two-channel bidirectional LSTM-CRF network |
Non-Patent Citations (4)
Title |
---|
Biomedical event trigger detection based on bidirectional LSTM and CRF;Yan Wang 等;《2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)》;20171218;第445-450页 * |
EMODMI: A Multi-Objective Optimization Based Method to Identify Disease Modules;Yansen Su 等;《 IEEE Transactions on Emerging Topics in Computational Intelligence》;20200821;第5卷(第4期);第570 - 582页 * |
基于文本挖掘的生物事件抽取关键问题研究;何馨宇;《中国博士学位论文全文数据库》;20200115;第I138-156页 * |
水下爬行机器人多目标路径规划的研究;苏延森 等;《合肥工业大学学报(自然科学版)》;20190531;第42卷(第2期);第178-183,229页 * |
Also Published As
Publication number | Publication date |
---|---|
CN113360667A (en) | 2021-09-07 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108984526B (en) | Document theme vector extraction method based on deep learning | |
CN111444726B (en) | Chinese semantic information extraction method and device based on long-short-term memory network of bidirectional lattice structure | |
CN112115238B (en) | Question-answering method and system based on BERT and knowledge base | |
CN112541356B (en) | Method and system for recognizing biomedical named entities | |
CN113360667B (en) | Biomedical trigger word detection and named entity identification method based on multi-task learning | |
CN106980608A (en) | A kind of Chinese electronic health record participle and name entity recognition method and system | |
CN108388560A (en) | GRU-CRF meeting title recognition methods based on language model | |
CN109284400A (en) | A kind of name entity recognition method based on Lattice LSTM and language model | |
Gao et al. | Named entity recognition method of Chinese EMR based on BERT-BiLSTM-CRF | |
CN114943230B (en) | Method for linking entities in Chinese specific field by fusing common sense knowledge | |
CN114239585B (en) | Biomedical nested named entity recognition method | |
CN111950283B (en) | Chinese word segmentation and named entity recognition system for large-scale medical text mining | |
CN111738007A (en) | Chinese named entity identification data enhancement algorithm based on sequence generation countermeasure network | |
CN112784604A (en) | Entity linking method based on entity boundary network | |
CN111914556A (en) | Emotion guiding method and system based on emotion semantic transfer map | |
CN114676255A (en) | Text processing method, device, equipment, storage medium and computer program product | |
CN113094502A (en) | Multi-granularity takeaway user comment sentiment analysis method | |
CN114818717A (en) | Chinese named entity recognition method and system fusing vocabulary and syntax information | |
CN115545021A (en) | Clinical term identification method and device based on deep learning | |
Suyanto | Synonyms-based augmentation to improve fake news detection using bidirectional LSTM | |
CN112800184A (en) | Short text comment emotion analysis method based on Target-Aspect-Opinion joint extraction | |
CN115510230A (en) | Mongolian emotion analysis method based on multi-dimensional feature fusion and comparative reinforcement learning mechanism | |
CN111444720A (en) | Named entity recognition method for English text | |
Liu et al. | Improved Chinese sentence semantic similarity calculation method based on multi-feature fusion | |
CN116522165B (en) | Public opinion text matching system and method based on twin structure |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |